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Preface 


These class notes are the currently used textbook for “Probabilistic Systems 
Analysis,” an introductory probability course at the Massachusetts Institute of 
Technology. The text of the notes is quite polished and complete, but the prob- 
lems are less so. 

The course is attended by a large number of undergraduate and graduate 
students with diverse backgrounds. Acccordingly, we have tried to strike a bal- 
ance between simplicity in exposition and sophistication in analytical reasoning. 
Some of the more mathematically rigorous analysis has been just sketched or 
intuitively explained in the text, so that complex proofs do not stand in the way 
of an otherwise simple exposition. At the same time, some of this analysis and 
the necessary mathematical results are developed (at the level of advanced calcu- 
lus) in theoretical problems, which are included at the end of the corresponding 
chapter. The theoretical problems (marked by *) constitute an important com- 
ponent of the text, and ensure that the mathematically oriented reader will find 
here a smooth development without major gaps. 

We give solutions to all the problems, aiming to enhance the utility of 
the notes for self-study. We have additional problems, suitable for homework 
assignment (with solutions), which we make available to instructors. 

Our intent is to gradually improve and eventually publish the notes as a 
textbook, and your comments will be appreciated 


Dimitri P. Bertsekas 
bertsekas@lids.mit.edu 


John N. Tsitsiklis 
jut@mit.edu 
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2 Sample Space and Probability Chap. 1 


“Probability” is a very useful concept, but can be interpreted in a number of 
ways. As an illustration, consider the following. 


A patient is admitted to the hospital and a potentially life-saving drug is 
administered. The following dialog takes place between the nurse and a 
concerned relative. 


RELATIVE: Nurse, what is the probability that the drug will work? 
NURSE: I hope it works, we’ll know tomorrow. 

RELATIVE: Yes, but what is the probability that it will? 

NURSE: Each case is different, we have to wait. 

RELATIVE: But let’s see, out of a hundred patients that are treated under 
similar conditions, how many times would you expect it to work? 

NURSE (somewhat annoyed): I told you, every person is different, for some 
it works, for some it doesn’t. 

RELATIVE (insisting): Then tell me, if you had to bet whether it will work 
or not, which side of the bet would you take? 

NURSE (cheering up for a moment): I'd bet it will work. 

RELATIVE (somewhat relieved): OK, now, would you be willing to lose 
two dollars if it doesn’t work, and gain one dollar if it does? 

NURSE (exasperated): What a sick thought! You are wasting my time! 


In this conversation, the relative attempts to use the concept of probability to 
discuss an uncertain situation. The nurse’s initial response indicates that the 
meaning of “probability” is not uniformly shared or understood, and the relative 
tries to make it more concrete. The first approach is to define probability in 
terms of frequency of occurrence, as a percentage of successes in a moderately 
large number of similar situations. Such an interpretation is often natural. For 
example, when we say that a perfectly manufactured coin lands on heads “with 
probability 50%,” we typically mean “roughly half of the time.” But the nurse 
may not be entirely wrong in refusing to discuss in such terms. What if this 
was an experimental drug that was administered for the very first time in this 
hospital or in the nurse’s experience? 

While there are many situations involving uncertainty in which the fre- 
quency interpretation is appropriate, there are other situations in which it is 
not. Consider, for example, a scholar who asserts that the Iliad and the Odyssey 
were composed by the same person, with probability 90%. Such an assertion 
conveys some information, but not in terms of frequencies, since the subject is 
a one-time event. Rather, it is an expression of the scholar’s subjective be- 
lief. One might think that subjective beliefs are not interesting, at least from a 
mathematical or scientific point of view. On the other hand, people often have 
to make choices in the presence of uncertainty, and a systematic way of making 
use of their beliefs is a prerequisite for successful, or at least consistent, decision 
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making. 

In fact, the choices and actions of a rational person, can reveal a lot about 
the inner-held subjective probabilities, even if the person does not make conscious 
use of probabilistic reasoning. Indeed, the last part of the earlier dialog was an 
attempt to infer the nurse’s beliefs in an indirect manner. Since the nurse was 
willing to accept a one-for-one bet that the drug would work, we may infer 
that the probability of success was judged to be at least 50%. And had the 
nurse accepted the last proposed bet (two-for-one), that would have indicated a 
success probability of at least 2/3. 

Rather than dwelling further into philosophical issues about the appropri- 
ateness of probabilistic reasoning, we will simply take it as a given that the theory 
of probability is useful in a broad variety of contexts, including some where the 
assumed probabilities only reflect subjective beliefs. There is a large body of 
successful applications in science, engineering, medicine, management, etc., and 
on the basis of this empirical evidence, probability theory is an extremely useful 
tool. 

Our main objective in this book is to develop the art of describing un- 
certainty in terms of probabilistic models, as well as the skill of probabilistic 
reasoning. The first step, which is the subject of this chapter, is to describe 
the generic structure of such models, and their basic properties. The models we 
consider assign probabilities to collections (sets) of possible outcomes. For this 
reason, we must begin with a short review of set theory. 


SETS 


Probability makes extensive use of set operations, so let us introduce at the 
outset the relevant notation and terminology. 

A set is a collection of objects, which are the elements of the set. If S is 
a set and x is an element of S, we write x € S. If x is not an element of S, we 
write « ¢ S. A set can have no elements, in which case it is called the empty 
set, denoted by @. 

Sets can be specified in a variety of ways. If S contains a finite number of 
elements, say 11, %2,...,%n, we write it as a list of the elements, in braces: 


S = {x1,02,...,0n}. 


For example, the set of possible outcomes of a die roll is {1, 2,3, 4,5,6}, and the 
set of possible outcomes of a coin toss is {H,T}, where H stands for “heads” 
and T stands for “tails.” 

If S contains infinitely many elements x71, v2,..., which can be enumerated 
in a list (so that there are as many elements as there are positive integers) we 
write 


S= {x1, 22, oe nhs 
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and we say that S is countably infinite. For example, the set of even integers 
can be written as {0,2,—2,4,—4,...}, and is countably infinite. 
Alternatively, we can consider the set of all « that have a certain property 
P, and denote it by 
{x|a satisfies P}. 


(The symbol “|” is to be read as “such that.”) For example the set of even 
integers can be written as {k|k/2 is integer}. Similarly, the set of all scalars x 
in the interval [0,1] can be written as {x|0 < a <1}. Note that the elements x 
of the latter set take a continuous range of values, and cannot be written down 
in a list (a proof is sketched in the theoretical problems); such a set is said to be 
uncountable. 

If every element of a set S is also an element of a set 7, we say that S 
is a subset of T, and we write S C TorT DS. If S Cc T and TC S, the 
two sets are equal, and we write S = T. It is also expedient to introduce a 
universal set, denoted by 2, which contains all objects that could conceivably 
be of interest in a particular context. Having specified the context in terms of a 
universal set , we only consider sets S that are subsets of 2. 


Set Operations 


The complement of a set S, with respect to the universe 2, is the set {x € 
Q|a € S} of all elements of 2 that do not belong to S, and is denoted by S¢. 
Note that Q¢ = @. 

The union of two sets S and T is the set of all elements that belong to S 
or T (or both), and is denoted by SUT’. The intersection of two sets S and T 
is the set of all elements that belong to both S and T, and is denoted by SNT. 
Thus, 

SUT ={a«lxeSorxre Th}, 


SaAT={«|c¢eSandzceT}. 
In some cases, we will have to consider the union or the intersection of several, 


even infinitely many sets, defined in the obvious way. For example, if for every 
positive integer n, we are given a set S,, then 


L) Sn = $1 U S2U-++ = {a| a € Sp for some n}, 


n=1 

and a 
() Sn = S1NS2N-++ = {alae Sp for all n}. 

n=1 


Two sets are said to be disjoint if their intersection is empty. More generally, 
several sets are said to be disjoint if no two of them have a common element. A 
collection of sets is said to be a partition of a set S if the sets in the collection 
are disjoint and their union is S. 
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If « and y are two objects, we use (x, y) to denote the ordered pair of x 
and y. The set of scalars (real numbers) is denoted by ¥; the set of pairs (or 
triplets) of scalars, i.e., the two-dimensional plane (or three-dimensional space, 
respectively) is denoted by 2 (or #3, respectively). 

Sets and the associated operations are easy to visualize in terms of Venn 
diagrams, as illustrated in Fig. 1.1. 


(d) (f) 


Figure 1.1: Examples of Venn diagrams. (a) The shaded region is SMT. (b) 
The shaded region is SUT. (c) The shaded region is SN T°. (d) Here, TC S. 
The shaded region is the complement of S. (e) The sets S, T, and U are disjoint. 
(f) The sets S, T, and U form a partition of the set 2. 


The Algebra of Sets 


Set operations have several properties, which are elementary consequences of the 
definitions. Some examples are: 


SuT=S TUS, SU(TUU) =(SUT)UU, 
Sntrum Sn ue, SU(TNU) =(SUT)N(SUD), 
(Sees 9, SN Se =9, 
SUQ=Q, SnQ=S. 


Two particularly useful properties are given by de Morgan’s laws which 


state that : P 
(Us:) =nsi. (Ns) -Uss 


To establish the first law, suppose that 7 € (UnS;)¢. Then, x ¢ UnSp, which 
implies that for every n, we have « ¢ S,. Thus, x belongs to the complement 
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of every S,, and tn € MS. This shows that (UnSn)° C AnS§. The converse 
inclusion is established by reversing the above argument, and the first law follows. 
The argument for the second law is similar. 


PROBABILISTIC MODELS 


A probabilistic model is a mathematical description of an uncertain situation. 
It must be in accordance with a fundamental framework that we discuss in this 
section. Its two main ingredients are listed below and are visualized in Fig. 1.2. 


Elements of a Probabilistic Model 


e The sample space Q, which is the set of all possible outcomes of an 
experiment. 


e The probability law, which assigns to a set A of possible outcomes 
(also called an event) a nonnegative number P(A) (called the proba- 
bility of A) that encodes our knowledge or belief about the collective 
“likelihood” of the elements of A. The probability law must satisfy 
certain properties to be introduced shortly. 


Sample Space Q 
(Set of Outcomes 


Events 


Figure 1.2: The main ingredients of a probabilistic model. 


Sample Spaces and Events 


Every probabilistic model involves an underlying process, called the experi- 
ment, that will produce exactly one out of several possible outcomes. The set 
of all possible outcomes is called the sample space of the experiment, and is 
denoted by Q. A subset of the sample space, that is, a collection of possible 
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outcomes, is called an event.! There is no restriction on what constitutes an 
experiment. For example, it could be a single toss of a coin, or three tosses, 
or an infinite sequence of tosses. However, it is important to note that in our 
formulation of a probabilistic model, there is only one experiment. So, three 
tosses of a coin constitute a single experiment, rather than three experiments. 

The sample space of an experiment may consist of a finite or an infinite 
number of possible outcomes. Finite sample spaces are conceptually and math- 
ematically simpler. Still, sample spaces with an infinite number of elements are 
quite common. For an example, consider throwing a dart on a square target and 
viewing the point of impact as the outcome. 


Choosing an Appropriate Sample Space 


Regardless of their number, different elements of the sample space should be 
distinct and mutually exclusive so that when the experiment is carried out, 
there is a unique outcome. For example, the sample space associated with the 
roll of a die cannot contain “1 or 3” as a possible outcome and also “1 or 4” as 
another possible outcome. When the roll is a 1, the outcome of the experiment 
would not be unique. 

A given physical situation may be modeled in several different ways, de- 
pending on the kind of questions that we are interested in. Generally, the sample 
space chosen for a probabilistic model must be collectively exhaustive, in the 
sense that no matter what happens in the experiment, we always obtain an out- 
come that has been included in the sample space. In addition, the sample space 
should have enough detail to distinguish between all outcomes of interest to the 
modeler, while avoiding irrelevant details. 


Example 1.1. Consider two alternative games, both involving ten successive coin 
tosses: 


Game 1: We receive $1 each time a head comes up. 


Game 2: We receive $1 for every coin toss, up to and including the first time 
a head comes up. Then, we receive $2 for every coin toss, up to the second 
time a head comes up. More generally, the dollar amount per toss is doubled 
each time a head comes up. 


y Any collection of possible outcomes, including the entire sample space 2 and 
its complement, the empty set @, may qualify as an event. Strictly speaking, however, 
some sets have to be excluded. In particular, when dealing with probabilistic models 
involving an uncountably infinite sample space, there are certain unusual subsets for 
which one cannot associate meaningful probabilities. This is an intricate technical issue, 
involving the mathematics of measure theory. Fortunately, such pathological subsets 
do not arise in the problems considered in this text or in practice, and the issue can be 
safely ignored. 
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In game 1, it is only the total number of heads in the ten-toss sequence that mat- 
ters, while in game 2, the order of heads and tails is also important. Thus, in 
a probabilistic model for game 1, we can work with a sample space consisting of 
eleven possible outcomes, namely, 0,1,...,10. In game 2, a finer grain description 
of the experiment is called for, and it is more appropriate to let the sample space 
consist of every possible ten-long sequence of heads and tails. 


Sequential Models 


Many experiments have an inherently sequential character, such as for example 
tossing a coin three times, or observing the value of a stock on five successive 
days, or receiving eight successive digits at a communication receiver. It is then 
often useful to describe the experiment and the associated sample space by means 
of a tree-based sequential description, as in Fig. 1.3. 


Sample Space Sequential Tree 
Pair of Rolls Description 
1,1 
4 1,2 
1,3 
1,4 
3 
geo Root Leaves 
2 
1 


2 3 
1st Roll 


Figure 1.3: Two equivalent descriptions of the sample space of an experiment 
involving two rolls of a 4-sided die. The possible outcomes are all the ordered pairs 
of the form (i, 7), where i is the result of the first roll, and j is the result of the 
second. These outcomes can be arranged in a 2-dimensional grid as in the figure 
on the left, or they can be described by the tree on the right, which reflects the 
sequential character of the experiment. Here, each possible outcome corresponds 
to a leaf of the tree and is associated with the unique path from the root to that 
leaf. The shaded area on the left is the event {(1,4), (2,4), (3,4), (4,4)} that the 
result of the second roll is 4. That same event can be described as a set of leaves, 
as shown on the right. Note also that every node of the tree can be identified with 
an event, namely, the set of all leaves downstream from that node. For example, 
the node labeled by a 1 can be identified with the event {(1, 1), (1, 2), (1,3), (1, 4)} 
that the result of the first roll is 1. 


Probability Laws 


Suppose we have settled on the sample space Q associated with an experiment. 
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Then, to complete the probabilistic model, we must introduce a probability 
law. Intuitively, this specifies the “likelihood” of any outcome, or of any set of 
possible outcomes (an event, as we have called it earlier). More precisely, the 
probability law assigns to every event A, a number P(A), called the probability 
of A, satisfying the following axioms. 


Probability Axioms 


1. (Nonnegativity) P(A) > 0, for every event A. 


2. (Additivity) If A and B are two disjoint events, then the probability 
of their union satisfies 


P(AU B) = P(A)4+ P(B). 
Furthermore, if the sample space has an infinite number of elements 
and Aj, Ag,... is a sequence of disjoint events, then the probability of 


their union satisfies 


P(A; U Ag U---) = P(A1) + P(A2) 4+ --- 


3. (Normalization) The probability of the entire sample space 2) is 
equal to 1, that is, P(Q) = 1. 


In order to visualize a probability law, consider a unit of mass which is 
to be “spread” over the sample space. Then, P(A) is simply the total mass 
that was assigned collectively to the elements of A. In terms of this analogy, the 
additivity axiom becomes quite intuitive: the total mass in a sequence of disjoint 
events is the sum of their individual masses. 

A more concrete interpretation of probabilities is in terms of relative fre- 
quencies: a statement such as P(A) = 2/3 often represents a belief that event A 
will materialize in about two thirds out of a large number of repetitions of the 
experiment. Such an interpretation, though not always appropriate, can some- 
times facilitate our intuitive understanding. It will be revisited in Chapter 7, in 
our study of limit theorems. 

There are many natural properties of a probability law which have not been 
included in the above axioms for the simple reason that they can be derived 
from them. For example, note that the normalization and additivity axioms 
imply that 


1= P(Q) = P(QU®) = P(Q) + P(@) = 1+ P(O), 
and this shows that the probability of the empty event is 0: 
P(@) = 0. 
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As another example, consider three disjoint events A1, A2, and As. We can use 
the additivity axiom for two disjoint events repeatedly, to obtain 


P(A1 U Az U As) = P(A1 U (A2 U A2)) 
= P(A) + P(A U A3) 
= P(A) + P(A2) + P(As3). 


Proceeding similarly, we obtain that the probability of the union of finitely many 
disjoint events is always equal to the sum of the probabilities of these events. 
More such properties will be considered shortly. 


Discrete Models 


Here is an illustration of how to construct a probability law starting from some 
common sense assumptions about a model. 


Example 1.2. Coin tosses. Consider an experiment involving a single coin 
toss. There are two possible outcomes, heads (#7) and tails (TJ). The sample space 
is Q = {H,T}, and the events are 


{H,T}, {H}, {T}, ©. 
If the coin is fair, i.e., if we believe that heads and tails are “equally likely,” we 


should assign equal probabilities to the two possible outcomes and specify that 
P({H}) = P({T}) = 0.5. The additivity axiom implies that 


P({H,T}) = P({H}) + P({T}) =1, 


which is consistent with the normalization axiom. Thus, the probability law is given 
by 


P({H,T})=1, P({H})=05, P({T})=05, P(@)=0, 
and satisfies all three axioms. 
Consider another experiment involving three coin tosses. The outcome will 
now be a 3-long string of heads or tails. The sample space is 
Q={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. 
We assume that each possible outcome has the same probability of 1/8. Let us 
construct a probability law that satisfies the three axioms. Consider, as an example, 


the event 


A = {exactly 2 heads occur} = {HHT, HTH, THH}. 
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Using additivity, the probability of A is the sum of the probabilities of its elements: 


P({HHT, HTH, THH}) = P({HHT}) + P({HTH}) + P({THH}) 


eae 
~ 8 8 8 
_3 
=5 


Similarly, the probability of any event is equal to 1/8 times the number of possible 
outcomes contained in the event. This defines a probability law that satisfies the 
three axioms. 


By using the additivity axiom and by generalizing the reasoning in the 
preceding example, we reach the following conclusion. 


Discrete Probability Law 


If the sample space consists of a finite number of possible outcomes, then the 
probability law is specified by the probabilities of the events that consist of 
a single element. In particular, the probability of any event {s1,s2,...,5n} 
is the sum of the probabilities of its elements: 


P({s1,52,...,8n}) = P({s1}) + P({s2}) +--+» + P({sn}). 


In the special case where the probabilities P({s1}),...,P({sn}) are all the 
same (by necessity equal to 1/n, in view of the normalization axiom), we obtain 
the following. 


Discrete Uniform Probability Law 

If the sample space consists of n possible outcomes which are equally likely 
(i.e., all single-element events have the same probability), then the proba- 
bility of any event A is given by 


P(A) = Number of elements of A 


n 


Let us provide a few more examples of sample spaces and probability laws. 


Example 1.3. Dice. Consider the experiment of rolling a pair of 4-sided dice (cf. 
Fig. 1.4). We assume the dice are fair, and we interpret this assumption to mean 
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that each of the sixteen possible outcomes [ordered pairs (7,7), with i, 7 = 1, 2,3, 4], 
has the same probability of 1/16. To calculate the probability of an event, we 
must count the number of elements of event and divide by 16 (the total number of 
possible outcomes). Here are some event probabilities calculated in this way: 


Sample Space 
Pair of Rolls 


~~ Event 


{at least one roll is a 4} 
Probability = 7/16 


2 3 
1st Roll 


Event 
{the first roll is equal to the second} 
Probability = 4/16 


Figure 1.4: Various events in the experiment of rolling a pair of 4-sided dice, 
and their probabilities, calculated according to the discrete uniform law. 


Continuous Models 


Probabilistic models with continuous sample spaces differ from their discrete 
counterparts in that the probabilities of the single-element events may not be 
sufficient to characterize the probability law. This is illustrated in the following 
examples, which also illustrate how to generalize the uniform probability law to 
the case of a continuous sample space. 
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Example 1.4. A wheel of fortune is continuously calibrated from 0 to 1, so the 
possible outcomes of an experiment consisting of a single spin are the numbers in 
the interval Q = [0,1]. Assuming a fair wheel, it is appropriate to consider all 
outcomes equally likely, but what is the probability of the event consisting of a 
single element? It cannot be positive, because then, using the additivity axiom, it 
would follow that events with a sufficiently large number of elements would have 
probability larger than 1. Therefore, the probability of any event that consists of a 
single element must be 0. 

In this example, it makes sense to assign probability b — a to any subinterval 
[a,b] of [0,1], and to calculate the probability of a more complicated set by eval- 


uating its “length.”' This assignment satisfies the three probability axioms and 
qualifies as a legitimate probability law. 


Example 1.5. Romeo and Juliet have a date at a given time, and each will arrive 
at the meeting place with a delay between 0 and 1 hour, with all pairs of delays 
being equally likely. The first to arrive will wait for 15 minutes and will leave if the 
other has not yet arrived. What is the probability that they will meet? 

Let us use as sample space the square 2 = [0,1] x [0,1], whose elements are 
the possible pairs of delays for the two of them. Our interpretation of “equally 
likely” pairs of delays is to let the probability of a subset of Q be equal to its area. 
This probability law satisfies the three probability axioms. The event that Romeo 
and Juliet will meet is the shaded region in Fig. 1.5, and its probability is calculated 
to be 7/16. 


Properties of Probability Laws 


Probability laws have a number of properties, which can be deduced from the 
axioms. Some of them are summarized below. 


Some Properties of Probability Laws 

Consider a probability law, and let A, B, and C be events. 
(a) If AC B, then P(A) < P(B). 

(b) P(AUB) = P(A) + P(B) — P(ANB). 

(c) P(AUB) < P(A)+P(B). 

(d) PAU BUC) =P(A)+ P(A6N B)+ P(AEN BEN C). 


1 The “length” of a subset S of [0,1] is the integral Ve dt, which is defined, for 
“nice” sets S, in the usual calculus sense. For unusual sets, this integral may not be 
well defined mathematically, but such issues belong to a more advanced treatment of 
the subject. 
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1/4 


Figure 1.5: The event M that Romeo and Juliet will arrive within 15 minutes 
of each other (cf. Example 1.5) is 


M = {(2,y) | |e-y| <1/4,0<2<1,0<y< 1}, 


and is shaded in the figure. The area of M is 1 minus the area of the two unshaded 
triangles, or 1 — (3/4) - (3/4) = 7/16. Thus, the probability of meeting is 7/16. 


These properties, and other similar ones, can be visualized and verified 
graphically using Venn diagrams, as in Fig. 1.6. For a further example, note 
that we can apply property (c) repeatedly and obtain the inequality 


P(A1 U Ag U-++U An) < 5° P(A). 
i=1 


In more detail, let us apply property (c) to the sets A; and Ag U---U An, to 
obtain 
P(A, U Ag U---U An) < P(A) + P(A2U---U An). 


We also apply property (c) to the sets Ag and A3 U---U A, to obtain 
P(A2U---U An) < P(A2) + P(A3 U---U An), 
continue similarly, and finally add. 


Models and Reality 


Using the framework of probability theory to analyze a physical but uncertain 
situation, involves two distinct stages. 


(a) In the first stage, we construct a probabilistic model, by specifying a prob- 
ability law on a suitably defined sample space. There are no hard rules to 
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Figure 1.6: Visualization and verification of various properties of probability 
laws using Venn diagrams. If A C B, then B is the union of the two disjoint 
events A and A° 1M B; see diagram (a). Therefore, by the additivity axiom, we 
have 

P(B) = P(A) + P(ASN B) > P(A), 


where the inequality follows from the nonnegativity axiom, and verifies prop- 


erty (a). 
From diagram (b), we can express the events AU B and B as unions of 
disjoint events: 


AUB=AU(A°NB), B=(ANB)U(ASNB). 
The additivity axiom yields 
P(AU B) = P(A) + P(AS NB), P(B) = P(AN B)+P(ASNB). 
Subtracting the second equality from the first and rearranging terms, we obtain 
P(AUB) = P(A) + P(B) — P(AN B), verifying property (b). Using also the fact 
P(AN B) > 0 (the nonnegativity axiom), we obtain P(AU B) < P(A) + P(B), 
verifying property (c) 


From diagram (c), we see that the event AU BUC can be expressed as a 
union of three disjoint events: 


AUBUC=AU(A°N B)U(ASN BENC), 


so property (d) follows as a consequence of the additivity axiom. 


1.3 
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guide this step, other than the requirement that the probability law con- 
form to the three axioms. Reasonable people may disagree on which model 
best represents reality. In many cases, one may even want to use a some- 
what “incorrect” model, if it is simpler than the “correct” one or allows for 
tractable calculations. This is consistent with common practice in science 
and engineering, where the choice of a model often involves a tradeoff be- 
tween accuracy, simplicity, and tractability. Sometimes, a model is chosen 
on the basis of historical data or past outcomes of similar experiments. 
Systematic methods for doing so belong to the field of statistics, a topic 
that we will touch upon in the last chapter of this book. 


In the second stage, we work within a fully specified probabilistic model and 
derive the probabilities of certain events, or deduce some interesting prop- 
erties. While the first stage entails the often open-ended task of connecting 
the real world with mathematics, the second one is tightly regulated by the 
rules of ordinary logic and the axioms of probability. Difficulties may arise 
in the latter if some required calculations are complex, or if a probability 
law is specified in an indirect fashion. Even so, there is no room for ambi- 
guity: all conceivable questions have precise answers and it is only a matter 
of developing the skill to arrive at them. 


Probability theory is full of “paradoxes” in which different calculation 


methods seem to give different answers to the same question. Invariably though, 
these apparent inconsistencies turn out to reflect poorly specified or ambiguous 
probabilistic models. 


CONDITIONAL PROBABILITY 


(a) 
(b) 
(c) 
(d) 


Conditional probability provides us with a way to reason about the outcome 
of an experiment, based on partial information. Here are some examples of 
situations we have in mind: 


In an experiment involving two successive rolls of a die, you are told that 
the sum of the two rolls is 9. How likely is it that the first roll was a 6? 


In a word guessing game, the first letter of the word is a “t”. What is the 
likelihood that the second letter is an “h”? 


How likely is it that a person has a disease given that a medical test was 
negative? 


A spot shows up on a radar screen. How likely is it that it corresponds to 
an aircraft? 


In more precise terms, given an experiment, a corresponding sample space, 


and a probability law, suppose that we know that the outcome is within some 
given event B. We wish to quantify the likelihood that the outcome also belongs 
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to some other given event A. We thus seek to construct a new probability law, 
which takes into account this knowledge and which, for any event A, gives us 
the conditional probability of A given B, denoted by P(A| B). 

We would like the conditional probabilities P(A|B) of different events A 
to constitute a legitimate probability law, that satisfies the probability axioms. 
They should also be consistent with our intuition in important special cases, e.g., 
when all possible outcomes of the experiment are equally likely. For example, 
suppose that all six possible outcomes of a fair die roll are equally likely. If we 
are told that the outcome is even, we are left with only three possible outcomes, 
namely, 2, 4, and 6. These three outcomes were equally likely to start with, 
and so they should remain equally likely given the additional knowledge that the 
outcome was even. Thus, it is reasonable to let 


1 
P(the outcome is 6| the outcome is even) = 3" 


This argument suggests that an appropriate definition of conditional probability 
when all outcomes are equally likely, is given by 


P(A|B) = number of elements of AN B 


number of elements of B 


Generalizing the argument, we introduce the following definition of condi- 
tional probability: 
P(AN B) 


P(A|B) = Sa 


where we assume that P(B) > 0; the conditional probability is undefined if the 
conditioning event has zero probability. In words, out of the total probability of 
the elements of B, P(A| B) is the fraction that is assigned to possible outcomes 
that also belong to A. 


Conditional Probabilities Specify a Probability Law 


For a fixed event B, it can be verified that the conditional probabilities P(A | B) 
form a legitimate probability law that satisfies the three axioms. Indeed, non- 
negativity is clear. Furthermore, 


_P(ONB) _ PB) _ 
PONE) ="piBy > BB) >” 


and the normalization axiom is also satisfied. In fact, since we have P(B| B) = 
P(B)/P(B) = 1, all of the conditional probability is concentrated on B. Thus, 
we might as well discard all possible outcomes outside B and treat the conditional 
probabilities as a probability law defined on the new universe B. 
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To verify the additivity axiom, we write for any two disjoint events A; and 

Ao, 
P((Ai U Ag) M B) 
P(B) 

— P((A1N B)U(A2N B)) 

- P(B) 

a P(A M B) + P(A M B) 

- P(B) 

i P(A M B) i P(A M B) 

P(B) P(B) 

= P(Ai |B) + P(A2|8), 
where for the second equality, we used the fact that Ai 7 B and A2/M B are 
disjoint sets, and for the third equality we used the additivity axiom for the 
(unconditional) probability law. The argument for a countable collection of 
disjoint sets is similar. 

Since conditional probabilities constitute a legitimate probability law, all 
general properties of probability laws remain valid. For example, a fact such as 
P(AUC) < P(A) + P(C) translates to the new fact 

P(AUC|B) < P(A|B)+ P(C|B). 


Let us summarize the conclusions reached so far. 


P(A U A2 |B) = 


Properties of Conditional Probability 


e The conditional probability of an event A, given an event B with 
P(B) > 0, is defined by 


P(AN B) 
P(A|B) = ——_— 
(AB) = So 
and specifies a new (conditional) probability law on the same sample 
space 2. In particular, all known properties of probability laws remain 
valid for conditional probability laws. 


e Conditional probabilities can also be viewed as a probability law on a 
new universe B, because all of the conditional probability is concen- 
trated on B. 


e In the case where the possible outcomes are finitely many and equally 
likely, we have 


number of elements of AN B 


P(A|B)= 
al) number of elements of B 
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Example 1.6. We toss a fair coin three successive times. We wish to find the 
conditional probability P(A|B) when A and B are the events 


A = {more heads than tails come up}, B = {ist toss is a head}. 
The sample space consists of eight sequences, 
Q={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, 


which we assume to be equally likely. The event B consists of the four elements 
HHH, HHT, HTH, HTT, so its probability is 


The event AM B consists of the three elements outcomes HHH, HHT, HTH, so 
its probability is 
P(ANB) = - 


Thus, the conditional probability P(A| B) is 


P(ANB) 3/8 3 
P(B) 4/8 4 


P(A|B)= 


Because all possible outcomes are equally likely here, we can also compute P(A| B) 
using a shortcut. We can bypass the calculation of P(B) and P(ANM B), and simply 
divide the number of elements shared by A and B (which is 3) with the number of 
elements of B (which is 4), to obtain the same result 3/4. 


Example 1.7. A fair 4-sided die is rolled twice and we assume that all sixteen 
possible outcomes are equally likely. Let X and Y be the result of the 1st and the 
2nd roll, respectively. We wish to determine the conditional probability P(A| B) 
where 

A= {max(X,Y) =m}, B= {minQGy)=2, 


and m takes each of the values 1, 2, 3, 4. 

As in the preceding example, we can first determine the probabilities P(ANB) 
and P(B) by counting the number of elements of AM B and B, respectively, and 
dividing by 16. Alternatively, we can directly divide the number of elements of 
ANB with the number of elements of B; see Fig. 1.7. 


Example 1.8. A conservative design team, call it C, and an innovative design 
team, call it N, are asked to separately design a new product within a month. From 
past experience we know that: 


(a) The probability that team C is successful is 2/3. 
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All Outcomes Equally Likely 
Probability = 1/16 


1st Roll X 


Figure 1.7: Sample space of an experiment involving two rolls of a 4-sided die. 
(cf. Example 1.7). The conditioning event B = {min(X,Y) = 2} consists of the 
5-element shaded set. The set A = {max(X, Y) = m} shares with B two elements 
ifm = 3 or m = 4, one element if m = 2, and no element ifm = 1. Thus, we have 


2/5 ifm=3o0rm=4, 
P({max(X,¥Y) = m}|B) = {15 ifm = 2, 
0 ifm=1. 


(b) The probability that team N is successful is 1/2. 
(c) The probability that at least one team is successful is 3/4. 


If both teams are successful, the design of team N is adopted. Assuming that exactly 
one successful design is produced, what is the probability that it was designed by 
team N? 

There are four possible outcomes here, corresponding to the four combinations 
of success and failure of the two teams: 


SS: both succeed, FF: both fail, 

SF: C succeeds, N fails, FS: C fails, N succeeds. 
We are given that the probabilities of these outcomes satisfy 
P(SS) + P(SF) = : P(SS) + P(FS) = _ P(SS) + P(SF) + P(FS) = . 


From these relations, together with the normalization equation P(S'S) + P(SF) + 
P(F'S) + P(FF) = 1, we can obtain the probabilities of all the outcomes: 


5 1 1 1 


P =— P(SF) =—-, P(F'S) = — P(FF)=-—. 
(SS)=3, PISF)=7, PUS)=5, PUFF)=5 
The desired conditional probability is 
1 
12 1 
P({FS}|{SF,FS}) == =. 


< 
4 12 
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Using Conditional Probability for Modeling 


When constructing probabilistic models for experiments that have a sequential 
character, it is often natural and convenient to first specify conditional prob- 
abilities and then use them to determine unconditional probabilities. The rule 
P(ANB) = P(B)P(A| 8B), which is a restatement of the definition of conditional 
probability, is often helpful in this process. 


Example 1.9. Radar detection. If an aircraft is present in a certain area, a 
radar correctly registers its presence with probability 0.99. If it is not present, the 
radar falsely registers an aircraft presence with probability 0.10. We assume that 
an aircraft is present with probability 0.05. What is the probability of false alarm 
(a false indication of aircraft presence), and the probability of missed detection 
(nothing registers, even though an aircraft is present)? 

A sequential representation of the sample space is appropriate here, as shown 
in Fig. 1.8. Let A and B be the events 


A = {an aircraft is present}, 


B = {the radar registers an aircraft presence}, 
and consider also their complements 


A® = {an aircraft is not present}, 


B® = {the radar does not register an aircraft presence}. 


The given probabilities are recorded along the corresponding branches of the tree 
describing the sample space, as shown in Fig. 1.8. Each event of interest corresponds 
to a leaf of the tree and its probability is equal to the product of the probabilities 
associated with the branches in a path from the root to the corresponding leaf. The 
desired probabilities of false alarm and missed detection are 


P(false alarm) = P(A°N B) = P(A°)P(B| A°) = 0.95 - 0.10 = 0.095, 
P(missed detection) = P(AN B°) = P(A)P(B* | A) = 0.05 - 0.01 = 0.0005. 


Extending the preceding example, we have a general rule for calculating 
various probabilities in conjunction with a tree-based sequential description of 
an experiment. In particular: 


(a) We set up the tree so that an event of interest is associated with a leaf. 
We view the occurrence of the event as a sequence of steps, namely, the 
traversals of the branches along the path from the root to the leaf. 


(b) We record the conditional probabilities associated with the branches of the 
tree. 


(c) We obtain the probability of a leaf by multiplying the probabilities recorded 
along the corresponding path of the tree. 
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Missed 
Detection 


Figure 1.8: Sequential description of the sample space for the radar detection 
problem in Example 1.9 


In mathematical terms, we are dealing with an event A which occurs if and 
only if each one of several events Ai,..., An has occurred, i.e., A = A1 NM Aon 
--» Ay. The occurrence of A is viewed as an occurrence of Aj, followed by 
the occurrence of Ag, then of A3, etc, and it is visualized as a path on the tree 
with n branches, corresponding to the events Ai,..., An. The probability of A 
is given by the following rule (see also Fig. 1.9). 


Multiplication Rule 


Assuming that all of the conditioning events have positive probability, we 
have 


P(n%, Ai) = P(A1)P(A2 | A1)P(A3 | Ai 9 Az) ++» P(An| MQ) Ai)- 


The multiplication rule can be verified by writing 


oa P(A; Az) P(AiN Az A3) Pirie, Ai) 
P( Vey A:) ~~ P(A) P(A) P(A, al A2) P( amy Ai) ’ 


and by using the definition of conditional probability to rewrite the right-hand 
side above as 


P(A1)P(A2 | A;)P(A3 | ALN Ag) “ -P(An, | aay Ai). 
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Event A; NAg NA3 Event Ay NAp NM ...NA, 


P(A, 1A, MAp NM ...An.4) 


Figure 1.9: Visualization of the total probability theorem. The intersection event 
A= AjNA2N---NAn is associated with a path on the tree of a sequential descrip- 
tion of the experiment. We associate the branches of this path with the events 
A1,...,An, and we record next to the branches the corresponding conditional 
probabilities. 

The final node of the path corresponds to the intersection event A, and 
its probability is obtained by multiplying the conditional probabilities recorded 
along the branches of the path 


P(A, 1 A2N--- Az) = P(A1)P(A2 | A1)--» P(An | A19 Ag N-++ 9 An-1)- 


Note that any intermediate node along the path also corresponds to some inter- 
section event and its probability is obtained by multiplying the corresponding 
conditional probabilities up to that node. For example, the event Ai M A2M A3 
corresponds to the node shown in the figure, and its probability is 


P(Ai N A2N A3) = P(A1)P(A2 | A1)P(A3 | Ai 2M Ag). 


For the case of just two events, Ai and Ag, the multiplication rule is simply the 
definition of conditional probability. 


Example 1.10. Three cards are drawn from an ordinary 52-card deck without 
replacement (drawn cards are not placed back in the deck). We wish to find the 
probability that none of the three cards is a heart. We assume that at each step, 
each one of the remaining cards is equally likely to be picked. By symmetry, this 
implies that every triplet of cards is equally likely to be drawn. A cumbersome 
approach, that we will not use, is to count the number of all card triplets that 
do not include a heart, and divide it with the number of all possible card triplets. 
Instead, we use a sequential description of the sample space in conjunction with the 
multiplication rule (cf. Fig. 1.10). 
Define the events 


A; = {the ith card is not a heart}, t= 1,2;3. 


We will calculate P(A; M Az A3), the probability that none of the three cards is 
a heart, using the multiplication rule, 


P(A; MN Aen A3) = P(A1)P(A2 | A1)P(As3 | Ain Ag). 
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We have 
39 
P(A,) = — 
(Ai) = 55 
since there are 39 cards that are not hearts in the 52-card deck. Given that the 
first card is not a heart, we are left with 51 cards, 38 of which are not hearts, and 


38 
P(Az|A1) = =. 


Finally, given that the first two cards drawn are not hearts, there are 37 cards which 
are not hearts in the remaining 50-card deck, and 


37 
P(A3 | Ain A2) = 50" 


These probabilities are recorded along the corresponding branches of the tree de- 
scribing the sample space, as shown in Fig. 1.10. The desired probability is now 
obtained by multiplying the probabilities recorded along the corresponding path of 
the tree: 


Pi enagS 2 


Note that once the probabilities are recorded along the tree, the probability 
of several other events can be similarly calculated. For example, 


P(1st is not a heart and 2nd is a heart) == : = 
: 39 38 «13 
P(1st two are not hearts and 3rd is a heart) 59°51 50° 


Not a Heart 


Not a Heart 
39/52 


Figure 1.10: Sequential description of the sample space of the 3-card selection 
problem in Example 1.10. 
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Example 1.11. A class consisting of 4 graduate and 12 undergraduate students 
is randomly divided into 4 groups of 4. What is the probability that each group 
includes a graduate student? We interpret randomly to mean that given the as- 
signment of some students to certain slots, any of the remaining students is equally 
likely to be assigned to any of the remaining slots. We then calculate the desired 
probability using the multiplication rule, based on the sequential description shown 
in Fig. 1.11. Let us denote the four graduate students by 1, 2, 3, 4, and consider 
the events 


A; = {students 1 and 2 are in different groups}, 
Aa = {students 1, 2, and 3 are in different groups}, 
Az = {students 1, 2, 3, and 4 are in different groups}. 


We will calculate P(A3) using the multiplication rule: 
P(A3) = P(Ai 1M Ag N Az) = P(A1)P(A2 | A1)P(A3 | Ai 9 Ag). 


We have ie 
P(A1) = 15° 


since there are 12 student slots in groups other than the one of student 1, and there 
are 15 student slots overall, excluding student 1. Similarly, 


8 

P(A2| Ai) = — 

(42| Ai) = 5, 

since there are 8 student slots in groups other than the one of students 1 and 2, 
and there are 14 student slots, excluding students 1 and 2. Also, 


4 

P(A3 | Ai M Az) = = 

(As | A419 A2) = 5a, 

since there are 4 student slots in groups other than the one of students 1, 2, and 

3, and there are 13 student slots, excluding students 1, 2, and 3. Thus, the desired 
probability is 


15 14 13’ 


and is obtained by multiplying the conditional probabilities along the corresponding 
path of the tree of Fig. 1.11. 


1.4 TOTAL PROBABILITY THEOREM AND BAYES’ RULE 


In this section, we explore some applications of conditional probability. We start 
with the following theorem, which is often useful for computing the probabilities 
of various events, using a “divide-and-conquer” approach. 
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Students 1, 2, 3, & 4 are 
in Different Groups 
4/13 


Students 1, 2, & 3 are 
in Different Groups 
8/14 


Students 1 & 2 are 
in Different Groups 


12/15 


Figure 1.11: Sequential description of the sample space of the student problem 
in Example 1.11. 


Total Probability Theorem 


Let Ai,...,An be disjoint events that form a partition of the sample space 
(each possible outcome is included in one and only one of the events Ai,..., An) 
and assume that P(A;) > 0, for alli =1,...,n. Then, for any event B, we 
have 
P(B) = P(A: NB) +---+P(An,N B) 
= P(A;)P(B| Ai) +--+ P(A,)P(B| An). 


The theorem is visualized and proved in Fig. 1.12. Intuitively, we are par- 


titioning the sample space into a number of scenarios (events) A;. Then, the 


pr 


obability that B occurs is a weighted average of its conditional probability 


under each scenario, where each scenario is weighted according to its (uncondi- 
tional) probability. One of the uses of the theorem is to compute the probability 


of 
ea 
th 


various events B for which the conditional probabilities P(B | A;) are known or 
sy to derive. The key is to choose appropriately the partition A1,...,An, and 
is choice is often suggested by the problem structure. Here are some examples. 


Example 1.12. You enter a chess tournament where your probability of winning 
a game is 0.3 against half the players (call them type 1), 0.4 against a quarter of 
the players (call them type 2), and 0.5 against the remaining quarter of the players 
(call them type 3). You play a game against a randomly chosen opponent. What 
is the probability of winning? 

Let A; be the event of playing with an opponent of type 7. We have 


P(A) =0.5,  P(As)=0.25,  -P(A3) = 0.25. 
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Ay A, NB 
A 
2 Ay NB 
B &s 
A, NB 
Be 


Figure 1.12: Visualization and verification of the total probability theorem. The 


events A1,...,An form a partition of the sample space, so the event B can be 
decomposed into the disjoint union of its intersections A; B with the sets Aj, 
i.e., 


B= (Ai NB)U---U(An MB). 


Using the additivity axiom, it follows that 
P(B) = P(AiN B)+---+ P(A, B). 
Since, by the definition of conditional probability, we have 
P(A; 9 B) = P(A;)P(B| Ai), 
the preceding equality yields 
P(B) = P(A1)P(B| Ai) +---+ P(An)P(B| An). 


For an alternative view, consider an equivalent sequential model, as shown 
on the right. The probability of the leaf A; B is the product P(A;)P(B|A;) of 
the probabilities along the path leading to that leaf. The event B consists of the 
three highlighted leaves and P(B) is obtained by adding their probabilities. 


Let also B be the event of winning. We have 
P(B| Ai) = 0.3, P(B|A2) = 0.4, P(B| A3) = 0.5. 
Thus, by the total probability theorem, the probability of winning is 
P(B) = P(A1)P(B| Ai) + P(A2)P(B| Az) + P(A3)P(B| As) 
= 0.5-0.3+4 0.25 -0.4+ 0.25 -0.5 
= 0.375. 


Example 1.13. We roll a fair four-sided die. If the result is 1 or 2, we roll once 
more but otherwise, we stop. What is the probability that the sum total of our 
rolls is at least 4? 
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Let A; be the event that the result of first roll is i, and note that P(A;) = 1/4 
for each 7. Let B be the event that the sum total is at least 4. Given the event Ai, 
the sum total will be at least 4 if the second roll results in 3 or 4, which happens 
with probability 1/2. Similarly, given the event Az, the sum total will be at least 
4 if the second roll results in 2, 3, or 4, which happens with probability 3/4. Also, 
given the event A3, we stop and the sum total remains below 4. Therefore, 


1 3 
P(B|A)=5, (Bl A2)= 5, P(B|As)=0,  P(B|As) = 1. 


By the total probability theorem, 


The total probability theorem can be applied repeatedly to calculate proba- 


bilities in experiments that have a sequential character, as shown in the following 
example. 


Example 1.14. Alice is taking a probability class and at the end of each week 
she can be either up-to-date or she may have fallen behind. If she is up-to-date in 
a given week, the probability that she will be up-to-date (or behind) in the next 
week is 0.8 (or 0.2, respectively). If she is behind in a given week, the probability 
that she will be up-to-date (or behind) in the next week is 0.6 (or 0.4, respectively). 
Alice is (by default) up-to-date when she starts the class. What is the probability 
that she is up-to-date after three weeks? 

Let U; and B; be the events that Alice is up-to-date or behind, respectively, 
after i weeks. According to the total probability theorem, the desired probability 
P(U3) is given by 


P(Us) = P(U2)P(U3 | U2) + P(B2)P(U3 | Bx) = P(U2)- 0.8 + P(B2) - 0.4. 


The probabilities P(U2) and P(B2) can also be calculated using the total probability 
theorem: 


P(U2) = P(U1)P(U2| U1) + P(B1)P(U2 | Bi) = P(Ui)- 0.8 + P(B1) - 0.4, 


P(B2) = P(Ui)P(Be2 | Ui) + P(Bi)P(Be2 | Bi) = P(Ui) - 0.2 + P(B1) - 0.6. 


Finally, since Alice starts her class up-to-date, we have 
P(U,) = 0.8, P(B,) = 0.2. 
We can now combine the preceding three equations to obtain 
P(U2) = 0.8-0.8+0.2-0.4 = 0.72, 


P(B2) = 0.8: 0.2 + 0.2: 0.6 = 0.28. 
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and by using the above probabilities in the formula for P(U3): 
P(U3) = 0.72 - 0.8 + 0.28 - 0.4 = 0.688. 


Note that we could have calculated the desired probability P(U3) by con- 
structing a tree description of the experiment, by calculating the probability of 
every element of U3 using the multiplication rule on the tree, and by adding. In 
experiments with a sequential character one may often choose between using the 
multiplication rule or the total probability theorem for calculation of various prob- 
abilities. However, there are cases where the calculation based on the total prob- 
ability theorem is more convenient. For example, suppose we are interested in 
the probability P(U2o) that Alice is up-to-date after 20 weeks. Calculating this 
probability using the multiplication rule is very cumbersome, because the tree rep- 
resenting the experiment is 20-stages deep and has 2?° leaves. On the other hand, 
with a computer, a sequential caclulation using the total probability formulas 


P(Ui41) = P(U;) -0.8 + P(B;) - 0.4, 


P(Bi+1) = P(U;) - 0.2 + P(Bi) - 0.6, 


and the initial conditions P(U;) = 0.8, P(Bi) = 0.2 is very simple. 


The total probability theorem is often used in conjunction with the fol- 
lowing celebrated theorem, which relates conditional probabilities of the form 
P(A| B) with conditional probabilities of the form P(B| A), in which the order 
of the conditioning is reversed. 


Bayes’ Rule 


Let Ai, Ag,...,An be disjoint events that form a partition of the sample 
space, and assume that P(A;) > 0, for all 7. Then, for any event B such 
that P(B) > 0, we have 


P(A] 8) = PONIPLBL A 


P( 
= P(Ai)P(B| Ai) 
~ P(A1)P(B| Ai) +++» + P(An)P(B] An)’ 


To verify Bayes’ rule, note that P(A;)P(B| Ai) and P(A;|B)P(B) are 
equal, because they are both equal to P(A; 9 B). This yields the first equality. 
The second equality follows from the first by using the total probability theorem 
to rewrite P(B). 

Bayes’ rule is often used for inference. There are a number of “causes” 
that may result in a certain “effect.” We observe the effect, and we wish to infer 
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the cause. The events A1,...,An are associated with the causes and the event B 
represents the effect. The probability P(B|A;) that the effect will be observed 
when the cause A; is present amounts to a probabilistic model of the cause-effect 
relation (cf. Fig. 1.13). Given that the effect B has been observed, we wish to 
evaluate the (conditional) probability P(A; |B) that the cause A; is present. 


Cause 3 A,B 
Other 
Cause 1 
Malignant Tumor 
Ay NB 
Cause 2 
Nonmalignant A, OB 
Tumor 3 


Figure 1.13: An example of the inference context that is implicit in Bayes’ 
rule. We observe a shade in a person’s X-ray (this is event B, the “effect”) and 
we want to estimate the likelihood of three mutually exclusive and collectively 
exhaustive potential causes: cause 1 (event A1) is that there is a malignant tumor, 
cause 2 (event Ag) is that there is a nonmalignant tumor, and cause 3 (event 
A3) corresponds to reasons other than a tumor. We assume that we know the 
probabilities P(A;) and P(B| A;), i = 1,2,3. Given that we see a shade (event 
B occurs), Bayes’ rule gives the conditional probabilities of the various causes as 


P(A;)P(B| Ai) 


P(A;| B) = P(A,)P(B|A1) + P(A2)P(B| Az) + P(A3)P(B| A3)’ 


4=1,2,3. 


For an alternative view, consider an equivalent sequential model, as shown 
on the right. The probability P(Ai | B) of a malignant tumor is the probability 
of the first highlighted leaf, which is P(A1 M B), divided by the total probability 
of the highlighted leaves, which is P(B). 


Example 1.15. Let us return to the radar detection problem of Example 1.9 and 
Fig. 1.8. Let 


A ={an aircraft is present}, 


B ={the radar registers an aircraft presence}. 
We are given that 


P(A) =0.05, P(B|A)=0.99, P(B|A°)=0.1. 
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Applying Bayes’ rule, with A; = A and A2 = A°, we obtain 


P(aircraft present | radar registers) = P(A| B) 
P(A)P(B| A) 
~ PB) 
P(A)P(B| A) 
P(A)P(B| A) + P(A*)P(B| A°) 
0.05 - 0.99 
0.05 - 0.99 + 0.95 - 0.1 


= 0.3426. 


Example 1.16. Let us return to the chess problem of Example 1.12. Here A; is 
the event of getting an opponent of type 2, and 


P(A1) = 0.5, P(A2) = 0.25, P(A3) = 0.25. 
Also, B is the event of winning, and 
P(B| Ai) = 0.3, P(B| Az) = 0.4, P(B| A3) = 0.5. 


Suppose that you win. What is the probability P(A: | B) that you had an opponent 
of type 1? 
Using Bayes’ rule, we have 


P(A1)P(B| A1) 
(A1)P(B| Ai) + P(A2)P(B| Az) + P(As)P(B| As) 
_ 0.5-0.3 
0.5-0.34+ 0.25-0.4+ 0.25-0.5 
= 0.4. 


P(A1|B) = 5 


1.5 INDEPENDENCE 


We have introduced the conditional probability P(A|B) to capture the partial 
information that event B provides about event A. An interesting and important 
special case arises when the occurrence of B provides no information and does 
not alter the probability that A has occurred, i.e., 


P(A|B) = P(A). 
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When the above equality holds, we say that A is independent of B. Note that 
by the definition P(A|B) = P(AN B)/P(B), this is equivalent to 


P(AN B) = P(A)P(B). 


We adopt this latter relation as the definition of independence because it can be 
used even if P(B) = 0, in which case P(A| B) is undefined. The symmetry of 
this relation also implies that independence is a symmetric property; that is, if 
A is independent of B, then B is independent of A, and we can unambiguously 
say that A and B are independent events. 

Independence is often easy to grasp intuitively. For example, if the occur- 
rence of two events is governed by distinct and noninteracting physical processes, 
such events will turn out to be independent. On the other hand, independence 
is not easily visualized in terms of the sample space. A common first thought 
is that two events are independent if they are disjoint, but in fact the opposite 
is true: two disjoint events A and B with P(A) > 0 and P(B) > 0 are never 
independent, since their intersection AM B is empty and has probability 0. 


Example 1.17. Consider an experiment involving two successive rolls of a 4-sided 
die in which all 16 possible outcomes are equally likely and have probability 1/16. 


(a) Are the events 
A; = {1st roll results in ¢}, B; = {2nd roll results in j}, 


independent? We have 


P(An B) = P(the result of the two rolls is (i, j)) = 7 
P(A,) = number of elements of A; _4 
: total number of possible outcomes 16’ 
P(B;) number of elements of B; _A 


~ total number of possible outcomes 16° 


We observe that P(A;M B;) = P(Ai)P(B;), and the independence of A; and 
B; is verified. Thus, our choice of the discrete uniform probability law (which 
might have seemed arbitrary) models the independence of the two rolls. 


(b) Are the events 
A = {ist roll is a 1}, B = {sum of the two rolls is a 5}, 


independent? The answer here is not quite obvious. We have 


i 
P(AN B) = P(the result of the two rolls is (1,4)) = Te 


and also 


number of elements of A _ 4 


P(A) = 


total number of possible outcomes 16° 
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The event B consists of the outcomes (1,4), (2,3), (3,2), and (4,1), and 


number of elements of B 4 


B)= =—. 
i) total number of possible outcomes 16 


Thus, we see that P(AMB) = P(A)P(B), and the events A and B are 
independent. 


(c) Are the events 


A = {maximum of the two rolls is 2}, B= {minimum of the two rolls is 2}, 


independent? Intuitively, the answer is “no” because the minimum of the two 
rolls tells us something about the maximum. For example, if the minimum is 
2, the maximum cannot be 1. More precisely, to verify that A and B are not 
independent, we calculate 


P(AN B)= P(the result of the two rolls is (2,2)) = a 


16 
and also 
P(A) = number of elements of A _ 3 
total number of possible outcomes 16’ 
P(B) number of elements of B 5 


~ total number of possible outcomes 16° 


We have P(A)P(B) = 15/(16)?, so that P(AN B) 4 P(A)P(B), and A and 
B are not independent. 


Conditional Independence 


We noted earlier that the conditional probabilities of events, conditioned on 
a particular event, form a legitimate probability law. We can thus talk about 
independence of various events with respect to this conditional law. In particular, 
given an event C,, the events A and B are called conditionally independent 


if 


P(AN B|C) =P(A|C)P(B|C). 


The definition of the conditional probability and the multiplication rule yield 


P(ANBNC) 
P(C) 
_ P(C)P(B|C)P(A| BNC) 
7 P(C) 
= P(B|C)P(A| BNC). 


P(ANB|C)= 
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After canceling the factor P(B|C), assumed nonzero, we see that conditional 
independence is the same as the condition 


P(A| BNC) =P(A|C). 


In words, this relation states that if C is known to have occurred, the additional 
knowledge that B also occurred does not change the probability of A. 

Interestingly, independence of two events A and B with respect to the 
unconditional probability law, does not imply conditional independence, and 
vice versa, as illustrated by the next two examples. 


Example 1.18. Consider two independent fair coin tosses, in which all four 
possible outcomes are equally likely. Let 


1, = {1st toss is a head}, 
Hz = {2nd toss is a head}, 


D = {the two tosses have different results}. 


The events H; and H2 are (unconditionally) independent. But 


1 i 
P(Hi|D)=5,  P(H2|D)=5, PUN Ha|D) =0, 


so that P(H1 M H2|D) # P(Ai|D)P(H2|D), and M1, H2 are not conditionally 
independent. 


Example 1.19. There are two coins, a blue and a red one. We choose one of 
the two at random, each being chosen with probability 1/2, and proceed with two 
independent tosses. The coins are biased: with the blue coin, the probability of 
heads in any given toss is 0.99, whereas for the red coin it is 0.01. 

Let B be the event that the blue coin was selected. Let also H; be the event 
that the ith toss resulted in heads. Given the choice of a coin, the events Hi and 
Hz are independent, because of our assumption of independent tosses. Thus, 


P(H, 9 H2| B) = P(A | B)P(H2 | B) = 0.99 - 0.99. 


On the other hand, the events H; and H2 are not independent. Intuitively, if we 
are told that the first toss resulted in heads, this leads us to suspect that the blue 
coin was selected, in which case, we expect the second toss to also result in heads. 
Mathematically, we use the total probability theorem to obtain 


P(H,) = P(B)P(Hi | B) + P(B°)P(Hi| B°) = ; - 0.99 4 ; -0.01 = 7 
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as should be expected from symmetry considerations. Similarly, we have P(H2) = 
1/2. Now notice that 
P(Hi 9 H2) = P(B)P(A1N A2 |B) + P(B°)P(Ai 2 A | B*) 


- 0.99 - 0.99 + ; - 0.01 - 0.01 = - 


Nl Re 


Thus, P(Hi1M H2) 4 P(A1)P(Hz2), and the events H; and Hz are dependent, even 
though they are conditionally independent given B. 


As mentioned earlier, if A and B are independent, the occurrence of B does 
not provide any new information on the probability of A occurring. It is then 
intuitive that the non-occurrence of B should also provide no information on the 
probability of A. Indeed, it can be verified that if A and B are independent, the 
same holds true for A and B¢ (see the theoretical problems). 

We now summarize. 


Independence 


e Two events A and B are said to independent if 
P(AN B) = P(A)P(B). 
If in addition, P(B) > 0, independence is equivalent to the condition 
P(A|B) = P(A). 


e If A and B are independent, so are A and B¢. 


e Two events A and B are said to be conditionally independent, given 
another event C' with P(C) > 0, if 


P(AN B|C)=P(A|C)P(B|C). 
If in addition, P(B NC) > 0, conditional independence is equivalent 
to the condition 


P(A| BNC) =P(A|C). 


e Independence does not imply conditional independence, and vice versa. 


Independence of a Collection of Events 


The definition of independence can be extended to multiple events. 
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Definition of Independence of Several Events 


We say that the events Ai, Ag,..., An are independent if 


P (a a = ][ Po. for every subset S of {1,2,...,n}. 


ieS iES 


If we have a collection of three events, Ai, Ag, and As, independence 
amounts to satisfying the four conditions 


P(A; ‘a Ao) = P(A) P(Ag), 
P(A; ‘a Az) = P(A) P(As), 
P(A2 a Az) = P(A2) P(As), 
P(A; N Aen Az) = P(A) P(A2) P(As3). 


The first three conditions simply assert that any two events are independent, 
a property known as pairwise independence. But the fourth condition is 
also important and does not follow from the first three. Conversely, the fourth 
condition does not imply the first three; see the two examples that follow. 


Example 1.20. Pairwise independence does not imply independence. 
Consider two independent fair coin tosses, and the following events: 
HT, = {1st toss is a head}, 
Hz = {2nd toss is a head}, 
D = {the two tosses have different results}. 
The events H, and H2 are independent, by definition. To see that Hi and D are 


independent, we note that 


P(HiND) 1/41 
P(h) 1/2 2 


P(D| 1) = = P(D). 


Similarly, Hz and D are independent. On the other hand, we have 
1 0: il 
P(A, N H2ND)=0A4 =-=-= = P(H,)P(H2)P(D), 


and these three events are not independent. 


Example 1.21. The equality P(Ai 9 A2M A3) = P(A1) P(A2) P(A3) is not 
enough for independence. Consider two independent rolls of a fair die, and 
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the following events: 


A = {lst roll is 1, 2, or 3}, 
B = {1st roll is 3, 4, or 5}, 
C = {the sum of the two rolls is 9}. 


We have a oT 
P(AN B)= 6 # 3°27 P(A)P(B), 
P(ANC)= - # = = P(A)P(C), 
P(BNC) = 5 #5°a = P(B)P(C) 


Thus the three events A, B, and C are not independent, and indeed no two of these 
events are independent. On the other hand, we have 


Ls =, AL 
36 2 2 36 


P(AN BNC) = P(A)P(B)P(C). 


The intuition behind the independence of a collection of events is anal- 
ogous to the case of two events. Independence means that the occurrence or 
non-occurrence of any number of the events from that collection carries no 
information on the remaining events or their complements. For example, if the 
events Ai, Ag, A3, Aa are independent, one obtains relations such as 


P(A U Ag | A3N Aa) = P(A; U Ag) 


or 
P(A, U AS | AS M Aa) = P(A; U AS); 


see the theoretical problems. 
Reliability 


In probabilistic models of complex systems involving several components, it is 
often convenient to assume that the components behave “independently” of each 
other. This typically simplifies the calculations and the analysis, as illustrated 
in the following example. 


Example 1.22. Network connectivity. A computer network connects two 
nodes A and B through intermediate nodes C, D, E, F, as shown in Fig. 1.14(a). 
For every pair of directly connected nodes, say i and 7, there is a given probability 
pij that the link from 7 to 7 is up. We assume that link failures are independent 
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Series Connection Parallel Connection 


(b) 


Figure 1.14: (a) Network for Example 1.22. The number next to each link 
(i,j) indicates the probability that the link is up. (b) Series and parallel 
connections of three components in a reliability problem. 


of each other. What is the probability that there is a path connecting A and B in 
which all links are up? 

This is a typical problem of assessing the reliability of a system consisting of 
components that can fail independently. Such a system can often be divided into 
subsystems, where each subsystem consists in turn of several components that are 
connected either in series or in parallel; see Fig. 1.14(b). 

Let a subsystem consist of components 1,2,...,m, and let p; be the prob- 
ability that component i is up (“succeeds”). Then, a series subsystem succeeds 
if all of its components are up, so its probability of success is the product of the 
probabilities of success of the corresponding components, i.e., 


P(series subsystem succeeds) = pip2--: Pm. 


A parallel subsystem succeeds if any one of its components succeeds, so its prob- 
ability of failure is the product of the probabilities of failure of the corresponding 
components, i.e., 


P(parallel subsystem succeeds) = 1 — P(parallel subsystem fails) 
=1—(1—-91)0 — po)-:+(1— pm), 


Returning now to the network of Fig. 1.14(a), we can calculate the probabil- 
ity of success (a path from A to B is available) sequentially, using the preceding 
formulas, and starting from the end. Let us use the notation X — Y to denote the 
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event that there is a (possibly indirect) connection from node X to node Y. Then, 


P(C > B)=1- (1-P(C > E and E> B))(1- P(C — F and F = B)) 
=1-(1—pcepes)(1— porpre) 
= 1-(1—0.8-0.9)(1 — 0.85 - 0.95) 
= 0.946, 
P(A > C and C > B) = P(A > C)P(C — B) = 0.9 - 0.946 = 0.851, 
P(A = D and D > B) = P(A— D)P(D = B) = 0.75 - 0.95 = 0.712, 


and finally we obtain the desired probability 


P(A > B) =1-(1—P(A—C and C = B))(1— P(A D and D = B)) 
= 1 - (1 —0.851)(1 — 0.712) 
= 0.957. 


Independent Trials and the Binomial Probabilities 


If an experiment involves a sequence of independent but identical stages, we say 
that we have a sequence of independent trials. In the special case where there 
are only two possible results at each stage, we say that we have a sequence of 
independent Bernoulli trials. The two possible results can be anything, e.g., 
‘it rains” or “it doesn’t rain,” but we will often think in terms of coin tosses and 
refer to the two results as “heads” (H) and “tails” (T). 

Consider an experiment that consists of n independent tosses of a biased 
coin, in which the probability of “heads” is p, where p is some number between 
0 and 1. In this context, independence means that the events Ai, A2,..., An are 
independent, where A; = {ith toss is a head}. 

We can visualize independent Bernoulli trials by means of a sequential 
description, as shown in Fig. 1.15 for the case where n = 3. The conditional 
probability of any toss being a head, conditioned on the results of any preced- 
ing tosses is p, because of independence. Thus, by multiplying the conditional 
probabilities along the corresponding path of the tree, we see that any particular 
outcome (3-long sequence of heads and tails) that involves k heads and 3 — k 
tails has probability p*(1 — p)8—*. This formula extends to the case of a general 
number n of tosses. We obtain that the probability of any particular n-long 
sequence that contains k heads and n — k tails is p*(1 — p)"—*, for all k from 0 
to n. 

Let us now consider the probability 


p(k) = P(k heads come up in an n-toss sequence), 
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HHH Prob = p? 
HHT Prob = p2(1 - p) 
HTH Prob = p2(1 - p) 


HTT Prob = p(1 - p)? 


THH = Prob = p@(1 - p) 


THT Prob=p(1 - p)2 


TTH  Prob=p(1 - p)2 


TTT Prob=(1-p)3 


Figure 1.15: Sequential description of the sample space of an experiment involv- 
ing three independent tosses of a biased coin. Along the branches of the tree, 
we record the corresponding conditional probabilities, and by the multiplication 
rule, the probability of obtaining a particular 3-toss sequence is calculated by 
multiplying the probabilities recorded along the corresponding path of the tree. 


which will play an important role later. We showed above that the probability 
of any given sequence that contains k heads is p*(1 — p)"—*, so we have 


n 
p(k) = Gao =p’, 
where 
& = number of distinct n-toss sequences that contain k heads. 


The numbers (7) (called “n choose k”) are known as the binomial coefficients, 


while the probabilities p(k) are known as the binomial probabilities. Using a 
counting argument, to be given in Section 1.6, one finds that 


where for any positive integer 7 we have 


ih=1-2---(6-1)-3, 
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and, by convention, 0! = 1. An alternative verification is sketched in the theo- 
retical problems. Note that the binomial probabilities p(k) must add to 1, thus 
showing the binomial formula 


n 


SS (j,)o*a —p)rF = 1, 


k=0 


Example 1.23. Grade of service. An internet service provider has installed c 
modems to serve the needs of a population of n customers. It is estimated that at a 
given time, each customer will need a connection with probability p, independently 
of the others. What is the probability that there are more customers needing a 
connection than there are modems? 

Here we are interested in the probability that more than c customers simul- 
taneously need a connection. It is equal to 


where 


are the binomial probabilities. 

This example is typical of problems of sizing the capacity of a facility to 
serve the needs of a homogeneous population, consisting of independently acting 
customers. The problem is to select the size c to achieve a certain threshold prob- 
ability (sometimes called grade of service) that no user is left unserved. 


1.6 COUNTING* 


The calculation of probabilities often involves counting of the number of out- 
comes in various events. We have already seen two contexts where such counting 
arises. 
(a) When the sample space 2 has a finite number of equally likely outcomes, 
so that the discrete uniform probability law applies. Then, the probability 
of any event A is given by 


P(A) = Number of elements of A 


~~ Number of elements of 2’ 


and involves counting the elements of A and of 2. 


(b) When we want to calculate the probability of an event A with a finite 
number of equally likely outcomes, each of which has an already known 
probability p. Then the probability of A is given by 


P(A) = p- (Number of elements of A), 
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and involves counting of the number of elements of A. An example of this 
type is the calculation of the probability of k heads in n coin tosses (the 
binomial probabilities). We saw there that the probability of each distinct 
sequence involving k heads is easily obtained, but the calculation of the 
number of all such sequences is somewhat intricate, as will be seen shortly. 


While counting is in principle straightforward, it is frequently challenging; 
the art of counting constitutes a large portion of a field known as combinatorics. 
In this section, we present the basic principle of counting and apply it to a number 
of situations that are often encountered in probabilistic models. 


The Counting Principle 


The counting principle is based on a divide-and-conquer approach, whereby the 
counting is broken down into stages through the use of a tree. For example, 
consider an experiment that consists of two consecutive stages. The possible 
results of the first stage are a1,d2,...,@m; the possible results of the second 
stage are bi, b2,...,bn. Then, the possible results of the two-stage experiment 
are all possible ordered pairs (ai,b;), i =1,...,m, 7 =1,...,n. Note that the 
number of such ordered pairs is equal to mn. This observation can be generalized 
as follows (see also Fig. 1.16). 


Leaves 


ny Np ng ng 
Choices Choices Choices Choices 


Stage 1 Stage2 Stage3 Stage 4 


Figure 1.16: Illustration of the basic counting principle. The counting is carried 
out in r stages (r = 4 in the figure). The first stage has n, possible results. For 
every possible result of the first 7 — 1 stages, there are n; possible results at the 
ith stage. The number of leaves is n1n2---nr. This is the desired count. 
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The Counting Principle 
Consider a process that consists of r stages. Suppose that: 
(a) There are ni possible results for the first stage. 


(b) For every possible result of the first stage, there are n2 possible results 
at the second stage. 


(c) More generally, for all possible results of the first i — 1 stages, there 
are n; possible results at the ith stage. 


Then, the total number of possible results of the r-stage process is 


Ny Ngee Np. 


Example 1.24. The number of telephone numbers. A telephone number 
is a 7-digit sequence, but the first digit has to be different from 0 or 1. How many 
distinct telephone numbers are there? We can visualize the choice of a sequence 
as a sequential process, where we select one digit at a time. We have a total of 7 
stages, and a choice of one out of 10 elements at each stage, except for the first 
stage where we only have 8 choices. Therefore, the answer is 


8-10-10---10=8- 10°. 
UuS__ ———’ 


6 times 


Example 1.25. The number of subsets of an n-element set. Consider an 
n-element set {81,82,..., $n}. How many subsets does it have (including itself and 
the empty set)? We can visualize the choice of a subset as a sequential process 
where we examine one element at a time and decide whether to include it in the set 
or not. We have a total of n stages, and a binary choice at each stage. Therefore 
the number of subsets is 


2-2---2=2”", 
n—Y—’ 
n times 


It should be noted that the Counting Principle remains valid even if each 
first-stage result leads to a different set of potential second-stage results, etc. The 
only requirement is that the number of possible second-stage results is constant, 
regardless of the first-stage result. This observation is used in the sequel. 

In what follows, we will focus primarily on two types of counting arguments 
that involve the selection of k objects out of a collection of n objects. If the order 
of selection matters, the selection is called a permutation, and otherwise, it is 
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called a combination. We will then discuss a more general type of counting, 
involving a partition of a collection of n objects into multiple subsets. 


k-permutations 


We start with n distinct objects, and let k be some positive integer, with k < n. 
We wish to count the number of different ways that we can pick k out of these 
n objects and arrange them in a sequence, i.e., the number of distinct k-object 
sequences. We can choose any of the n objects to be the first one. Having chosen 
the first, there are only n — 1 possible choices for the second; given the choice of 
the first two, there only remain n — 2 available objects for the third stage, etc. 
When we are ready to select the last (the kth) object, we have already chosen 
k — 1 objects, which leaves us with n — (k — 1) choices for the last one. By the 
Counting Principle, the number of possible sequences, called k-permutations, 
7 n(n—1)-(n—k+1(n— ky 2-1 
(n—k)---2-1 


n(n—1)---(n—k4+1)= 


n! 
a Geer 


In the special case where k = n, the number of possible sequences, simply called 
permutations, is 
n-(n—1)-(n—2)---2-L=nl. 


(Let & = n in the formula for the number of k-permutations, and recall the 
convention 0! = 1.) 


Example 1.26. Let us count the number of words that consist of four distinct 
letters. This is the problem of counting the number of 4-permutations of the 26 
letters in the alphabet. The desired number is 


26! 
Gn = gpl = 2825-24-23 = 358, 800 


The count for permutations can be combined with the Counting Principle 
to solve more complicated counting problems. 


Example 1.27. You have n; classical music CDs, nz rock music CDs, and n3 
country music CDs. In how many different ways can you arrange them so that the 
CDs of the same type are contiguous? 

We break down the problem in two stages, where we first select the order of 
the CD types, and then the order of the CDs of each type. There are 3! ordered se- 
quences of the types of CDs (such as classical/rock/country, rock/country/classical, 
etc), and there are ni! (or ng!, or m3!) permutations of the classical (or rock, or 
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country, respectively) CDs. Thus for each of the 3! CD type sequences, there are 
ni! ne2!n3! arrangements of CDs, and the desired total number is 3! ni! n2! nz!. 


Combinations 


There are n people and we are interested in forming a committee of k. How 
many different committees are there? More abstractly, this is the same as the 
problem of counting the number of k-element subsets of a given n-element set. 
Notice that forming a combination is different than forming a k-permutation, 
because in a combination there is no ordering of the selected elements. 
Thus for example, whereas the 2-permutations of the letters A, B, C, and D are 


AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, DC, 
the combinations of two out four of these letters are 
AB, AC, AD, BC, BD, CD. 


There is a close connection between the number of combinations and the 
binomial coefficient that was introduced in Section 1.5. To see this note that 
specifying an n-toss sequence with k heads is the same as picking k elements 
(those that correspond to heads) out of the n-element set of tosses. Thus, the 
number of combinations is the same as the binomial coefficient (7) introduced 
in Section 1.5. 

To count the number of combinations, note that selecting a k-permutation 
is the same as first selecting a combination of k items and then ordering them. 
Since there are k! ways of ordering the k selected items, we see that the number 
of k-permutations is equal to the number of combinations times k!. Hence, the 
number of possible combinations, is given by 


Example 1.28. The number of combinations of two out of the four letters A, B, 
C, and D is found by letting n = 4 and k = 2. It is 


4\ 4! 
(;) =r © 


consistently with the listing given earlier. 


It is worth observing that counting arguments sometimes lead to formulas 
that are rather difficult to derive algebraically. One example is the binomial 
formula 
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discussed in Section 1.5. Here is another example. Since t} is the number of 
k-element subsets of a given n-element subset, the sum over k of es) counts the 
number of subsets of all possible cardinalities. It is therefore equal to the number 


of all subsets of an n-element set, which is 2”, and we obtain 


E()-» 


k=0 


Partitions 


Recall that a combination is a choice of k elements out of an n-element set 
without regard to order. This is the same as partitioning the set in two: one 
part contains k elements and the other contains the remaining n — k. We now 
generalize by considering partitions in more than two subsets. 

We have n distinct objects and we are given nonnegative integers n1,72,..., 
ny, whose sum is equal ton. The n items are to be divided into r disjoint groups, 
with the ith group containing exactly n; items. Let us count in how many ways 
this can be done. 

We form the groups one at a time. We have Ce) ways of forming the first 
group. Having formed the first group, we are left with n — ni objects. We need 
to choose nz of them in order to form the second group, and we have Ca) 
choices, etc. Using the Counting Principle for this r-stage process, the total 
number of choices is 


n n— ny n— ny — ne N—-Ny— +t Nr 
ni n2 NZ Nr , 


which is equal to 


n! (n— m1)! -_ (n— 11 —+++— Np—1)! 


m(n—n)!ne!(n-—ni—ne)!  (n-ny—--+—np-1— ny)! n,-! 
We note that several terms cancel and we are left with 


n! 


m!ne!-+-npl 


This is called the multinomial coefficient and is usually denoted by 


n 
M1,22,.--5Mr 


Example 1.29. Anagrams. How many different letter sequences can be obtained 
by rearranging the letters in the word TATTOO? There are six positions to be filled 
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by the available letters. Each rearrangement corresponds to a partition of the set of 
the six positions into a group of size 3 (the positions that get the letter T), a group 
of size 1 (the position that gets the letter A), and a group of size 2 (the positions 
that get the letter O). Thus, the desired number is 


9.3.4.5. 
1-2-1-2- 


6 
12.3 60. 


1: 
te 

It is instructive to rederive this answer using an alternative argument. (This 
argument can also be used to rederive the multinomial coefficient formula; see 
the theoretical problems.) Let us rewrite TATTOO in the form T,AT2T30102 
pretending for a moment that we are dealing with 6 distinguishable objects. These 
6 objects can be rearranged in 6! different ways. However, any of the 3! possible 
permutations of Ti, Ti, and T3, as well as any of the 2! possible permutations of 
O; and Oz, lead to the same word. Thus, when the subscripts are removed, there 
are only 6!/(3!2!) different words. 


Example 1.30. A class consisting of 4 graduate and 12 undergraduate students 
is randomly divided into four groups of 4. What is the probability that each group 
includes a graduate student? This is the same as Example 1.11 in Section 1.3, but 
we will now obtain the answer using a counting argument. 

We first determine the nature of the sample space. A typical outcome is a 
particular way of partitioning the 16 students into four groups of 4. We take the 
term “randomly” to mean that every possible partition is equally likely, so that the 
probability question can be reduced to one of counting. 

According to our earlier discussion, there are 


16 \ 16! 
4,4,4,4) ~ 4141414! 


different partitions, and this is the size of the sample space. 
Let us now focus on the event that each group contains a graduate student. 
Generating an outcome with this property can be accomplished in two stages: 


(a) Take the four graduate students and distribute them to the four groups; there 
are four choices for the group of the first graduate student, three choices for 
the second, two for the third. Thus, there is a total of 4! choices for this stage. 


(b) Take the remaining 12 undergraduate students and distribute them to the 
four groups (3 students in each). This can be done in 


Dae ens 
8,3;3,4)° BY3h3!3! 


different ways. 
By the Counting Principle, the event of interest can materialize in 


4! 12! 
3! 3! 313! 
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different ways. The probability of this event is 
4! 12! 
3! 3! 3! 3! 
16! 
4! 4! 4! 4! 


After some cancellations, we can see that this is the same as the answer 12-8 - 
4/(15- 14-13) obtained in Example 1.11. 


Here is a summary of all the counting results we have developed. 


Summary of Counting Results 


Permutations of n objects: n! 


k-permutations of n objects: n!/(n — k)! 


n! 
~ El(n — k)! 


Partitions of n objects into r groups with the ith group having n; 


objects: 
n n 
N1,N2,...,Np ny! ngl--- np! 


Combinations of k out of n objects: & 


1.7 SUMMARY AND DISCUSSION 


A probability problem can usually be broken down into a few basic steps: 


1. The description of the sample space, that is, the set of possible outcomes 
of a given experiment. 


2. The (possibly indirect) specification of the probability law (the probability 
of each event). 


3. The calculation of probabilities and conditional probabilities of various 
events of interest. 


The probabilities of events must satisfy the nonnegativity, additivity, and nor- 
malization axioms. In the important special case where the set of possible out- 
comes is finite, one can just specify the probability of each outcome and obtain 
the probability of any event by adding the probabilities of the elements of the 
event. 

Conditional probabilities can be viewed as probability laws on the same 
sample space. We can also view the conditioning event as a new universe, be- 
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cause only outcomes contained in the conditioning event can have positive condi- 
tional probability. Conditional probabilities are derived from the (unconditional) 
probability law using the definition P(A|B) = P(AN B)/P(B). However, the 
reverse process is often convenient, that is, first specify some conditional proba- 
bilities that are natural for the real situation that we wish to model, and then 
use them to derive the (unconditional) probability law. Two important tools in 
this context are the multiplication rule and the total probability theorem. 

We have illustrated through examples three methods of specifying proba- 
bility laws in probabilistic models: 


(1) The counting method. This method applies to the case where the num- 
ber of possible outcomes is finite, and all outcomes are equally likely. To 
calculate the probability of an event, we count the number of elements in 
the event and divide by the number of elements of the sample space. 


(2) The sequential method. This method applies when the experiment has a 
sequential character, and suitable conditional probabilities are specified or 
calculated along the branches of the corresponding tree (perhaps using the 
counting method). The probabilities of various events are then obtained 
by multiplying conditional probabilities along the corresponding paths of 
the tree, using the multiplication rule. 


(3) The divide-and-conquer method. Here, the probabilities P(B) of vari- 
ous events B are obtained from conditional probabilities P(B|A:), where 
the A; are suitable events that form a partition of the sample space and 
have known probabilities P(A;). The probabilities P(B) are then obtained 
by using the total probability theorem. 


Finally, we have focused on a few side topics that reinforce our main themes. 
We have discussed the use of Bayes’ rule in inference, which is an important 
application context. We have also discussed some basic principles of counting 
and combinatorics, which are helpful in applying the counting method. 


Discrete Random Variables 


2.1. 
2.2. 
2.3. 
2.4. 
2.5. 
2.6. 
2.7. 
2.8. 


Contents 
Basic Concepts . . 2. 6. ee ee p. 2 
Probability Mass Functions ..........2.2...4202. p.4 
Functions of Random Variables... .. 2... p. 9 
Expectation, Mean, and Variance... .......2.2.. p. ll 
Joint PMFs of Multiple Random Variables... ...... p. 22 
Gonditioning® i. :¢. 6 6 6 as Bg a a a a p. 27 
Independence. 2 24 GR eee eR ee Re p. 36 


Summary and Discussion . .... 2... . 2... p. 42 


2 Discrete Random Variables Chap. 2 


2.1 BASIC CONCEPTS 


In many probabilistic models, the outcomes are of a numerical nature, e.g., if 
they correspond to instrument readings or stock prices. In other experiments, 
the outcomes are not numerical, but they may be associated with some numerical 
values of interest. For example, if the experiment is the selection of students from 
a given population, we may wish to consider their grade point average. When 
dealing with such numerical values, it is often useful to assign probabilities to 
them. This is done through the notion of a random variable, the focus of the 
present chapter. 

Given an experiment and the corresponding set of possible outcomes (the 
sample space), a random variable associates a particular number with each out- 
come; see Fig. 2.1. We refer to this number as the numerical value or the 
experimental value of the random variable. Mathematically, a random vari- 
able is a real-valued function of the experimental outcome. 


Random Variable X 


a 


Sample Space 
Q 


x 
Real Number Line 


4 Random Variable: 

X = Maximum Roll 
3 
: 4 

Real Number Line 
1 
1 2 cS] 4 
Sample Space: 
Pairs of Rolls 


(b) 


Figure 2.1: (a) Visualization of a random variable. It is a function that assigns 
a numerical value to each possible outcome of the experiment. (b) An example 
of a random variable. The experiment consists of two rolls of a 4-sided die, and 
the random variable is the maximum of the two rolls. If the outcome of the 
experiment is (4,2), the experimental value of this random variable is 4. 


Here are some examples of random variables: 


(a) In an experiment involving a sequence of 5 tosses of a coin, the number of 
heads in the sequence is a random variable. However, the 5-long sequence 
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of heads and tails is not considered a random variable because it does not 
have an explicit numerical value. 


(b) In an experiment involving two rolls of a die, the following are examples of 
random variables: 


(1) The sum of the two rolls. 
(2) The number of sixes in the two rolls. 
(3) The second roll raised to the fifth power. 


(c) In an experiment involving the transmission of a message, the time needed 
to transmit the message, the number of symbols received in error, and the 
delay with which the message is received are all random variables. 


There are several basic concepts associated with random variables, which 
are summarized below. 


Main Concepts Related to Random Variables 
Starting with a probabilistic model of an experiment: 


e A random variable is a real-valued function of the outcome of the 
experiment. 


A function of a random variable defines another random variable. 


e We can associate with each random variable certain “averages” of in- 
terest, such the mean and the variance. 


A random variable can be conditioned on an event or on another 
random variable. 


There is a notion of independence of a random variable from an 
event or from another random variable. 


A random variable is called discrete if its range (the set of values that 
it can take) is finite or at most countably infinite. For example, the random 
variables mentioned in (a) and (b) above can take at most a finite number of 
numerical values, and are therefore discrete. 

A random variable that can take an uncountably infinite number of values 
is not discrete. For an example, consider the experiment of choosing a point 
a from the interval [—1, 1]. The random variable that associates the numerical 
value a? to the outcome a is not discrete. On the other hand, the random variable 
that associates with a the numerical value 


1 ifa>0, 
sgn(a) = 0 ifa=0, 
-1 ifa<Q, 


is discrete. 


2.2 
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In this chapter, we focus exclusively on discrete random variables, even 
though we will typically omit the qualifier “discrete.” 


Concepts Related to Discrete Random Variables 
Starting with a probabilistic model of an experiment: 


e A discrete random variable is a real-valued function of the outcome 
of the experiment that can take a finite or countably infinite number 
of values. 


e A (discrete) random variable has an associated probability mass 
function (PMF), which gives the probability of each numerical value 
that the random variable can take. 


e A function of a random variable defines another random variable, 
whose PMF can be obtained from the PMF of the original random 
variable. 


We will discuss each of the above concepts and the associated methodology 
in the following sections. In addition, we will provide examples of some important 
and frequently encountered random variables. In Chapter 3, we will discuss 
general (not necessarily discrete) random variables. 

Even though this chapter may appear to be covering a lot of new ground, 
this is not really the case. The general line of development is to simply take 
the concepts from Chapter 1 (probabilities, conditioning, independence, etc.) 
and apply them to random variables rather than events, together with some 
appropriate new notation. The only genuinely new concepts relate to means and 
variances. 


PROBABILITY MASS FUNCTIONS 


The most important way to characterize a random variable is through the prob- 
abilities of the values that it can take. For a discrete random variable X, these 
are captured by the probability mass function (PMF for short) of X, denoted 
px. In particular, if x is any possible value of X, the probability mass of 2, 
denoted px (x), is the probability of the event {X = x} consisting of all outcomes 
that give rise to a value of X equal to z: 


px(2) = P({X =2}). 
For example, let the experiment consist of two independent tosses of a fair coin, 
and let X be the number of heads obtained. Then the PMF of X is 


1/4 ife=Oorr=2, 
px(@)=41/2 ife=1, 


0 otherwise. 
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In what follows, we will often omit the braces from the event /set notation, 
when no ambiguity can arise. In particular, we will usually write P(X = x) 
in place of the more correct notation P({X = x}). We will also adhere to 
the following convention throughout: we will use upper case characters 
to denote random variables, and lower case characters to denote real 
numbers such as the numerical values of a random variable. 


Note that 
S$) px(a) = 


where in the summation above, x ranges over all the possible numerical values 
of X. This follows from the additivity and normalization axioms, because the 
events {X = 2} are disjoint and form a partition of the sample space, as x 
ranges over all possible values of X. By a similar argument, for any set S of real 
numbers, we also have 
P(X € S)= S- px ( 
Es 

For example, if X is the number of heads obtained in two independent tosses of 
a fair coin, as above, the probability of at least one head is 


P(X > 0) = So px(z) 


x«>0 


a8 
=7 


“i 
ey 


Calculating the PMF of X is conceptually straightforward, and is illus- 
trated in Fig. 2.2. 


Calculation of the PMF of a Random Variable X 

For each possible value x of X: 
1. Collect all the possible outcomes that give rise to the event {X = z}. 
2. Add their probabilities to obtain px (x). 


The Bernoulli Random Variable 


Consider the toss of a biased coin, which comes up a head with probability p, 
and a tail with probability 1—p. The Bernoulli random variable takes the two 
values 1 and 0, depending on whether the outcome is a head or a tail: 


ee {i if a head, 
0 ifa tail. 


Its PMF is 
_ fp if 7 = 1, 
px= 44 if « =0. 
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Event {X=} 


Random Variable: 
X = Maximum Roll 


Sample Space: 
Pairs of Rolls (b) 


Figure 2.2: (a) Illustration of the method to calculate the PMF of a random 
variable X. For each possible value x, we collect all the outcomes that give rise 
to X = « and add their probabilities to obtain px(a). (b) Calculation of the 
PMF px of the random variable X = maximum roll in two independent rolls 
of a fair 4-sided die. There are four possible values x, namely, 1, 2, 3, 4. To 
calculate px (a) for a given x, we add the probabilities of the outcomes that give 
rise to x. For example, there are three outcomes that give rise to x = 2, namely, 
(1, 2), (2, 2), (2,1). Each of these outcomes has probability 1/16, so px (2) = 3/16, 
as indicated in the figure. 


For all its simplicity, the Bernoulli random variable is very important. In 
practice, it is used to model generic probabilistic situations with just two out- 
comes, such as: 


(a) The state of a telephone at a given time that can be either free or busy. 
(b) A person who can be either healthy or sick with a certain disease. 


(c) The preference of a person who can be either for or against a certain po- 
litical candidate. 


Furthermore, by combining multiple Bernoulli random variables, one can con- 
struct more complicated random variables. 
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The Binomial Random Variable 


A biased coin is tossed n times. At each toss, the coin comes up a head with 
probability p, and a tail with probability 1—p, independently of prior tosses. Let 
X be the number of heads in the n-toss sequence. We refer to X as a binomial 
random variable with parameters n and p. The PMF of X consists of the 
binomial probabilities that were calculated in Section 1.4: 


px(k) = P(X =k) = (i —p)r-k, &=0,1,...,n. 


(Note that here and elsewhere, we simplify notation and use k, instead of x, to 
denote the experimental values of integer-valued random variables.) The nor- 
malization property )_,, px (x) = 1, specialized to the binomial random variable, 


is written as 
nm 


> Gao ce) i ale 
k 
k=0 
Some special cases of the binomial PMF are sketched in Fig. 2.3. 


(ki 
oe py‘) 


Binomial PMF 
n= Large, p = Small 


Binomial PMF n=9, p= 1/2 


oe4-5 678 9 5 


Figure 2.3: The PMF of a binomial random variable. If p = 1/2, the PMF is 
symmetric around n/2. Otherwise, the PMF is skewed towards 0 if p < 1/2, and 
towards n if p > 1/2. 


The Geometric Random Variable 


Suppose that we repeatedly and independently toss a biased coin with probability 
of a head p, where 0 < p < 1. The geometric random variable is the number 
X of tosses needed for a head to come up for the first time. Its PMF is given by 


px(k) = (1—p)*-1p, k=1,2,..., 


since (1 — p)*~1p is the probability of the sequence consisting of k — 1 successive 
tails followed by a head; see Fig. 2.4. This is a legitimate PMF because 


Dx h) = dU —p)*-1p= Pd =e ps a=») =1. 


=1 
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Naturally, the use of coin tosses here is just to provide insight. More 
generally, we can interpret the geometric random variable in terms of repeated 
independent trials until the first “success.” Each trial has probability of success 
p and the number of trials until (and including) the first success is modeled by 
the geometric random variable. 


Figure 2.4: The PMF 
px(k)=(1—p)*"*p, k= 1,2,..., 


of a geometric random variable. It decreases as a geometric progression with 
parameter 1 — p. 


The Poisson Random Variable 


A Poisson random variable takes nonnegative integer values. Its PMF is given 
by 
Ns 
px(k) =e*7-, k=0,1,2,..., 
where A is a positive parameter characterizing the PMF, see Fig. 2.5. It isa 
legitimate PMF because 


oe k 2 43 
ere aea(t tA x 3I po) edad, 


To get a feel for the Poisson random variable, think of a binomial random 
variable with very small p and very large n. For example, consider the number 
of typos in a book with a total of n words, when the probability p that any one 
word is misspelled is very small (associate a word with a coin toss which comes 
a head when the word is misspelled), or the number of cars involved in accidents 
in a city on a given day (associate a car with a coin toss which comes a head 
when the car has an accident). Such a random variable can be well-modeled as 
a Poisson random variable. 


2.3 
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pk) Py(k) 


Poisson A = 0.5 Poisson a =3 


Figure 2.5: The PMF e~* a of the Poisson random variable for different values 
of A. Note that if A < 1, then the PMF is monotonically decreasing, while if 
> 1, the PMF first increases and then decreases as the value of k increases (this 


is shown in the end-of-chapter problems). 


More precisely, the Poisson PMF with parameter . is a good approximation 
for a binomial PMF with parameters n and p, provided \ = np, n is very large, 
and p is very small, i.e., 


In this case, using the Poisson PMF may result in simpler models and calcula- 
tions. For example, let n = 100 and p = 0.01. Then the probability of k = 5 
successes in n = 100 trials is calculated using the binomial PMF as 

100! 

——().015(1 — 0.01)95 = 0.00290. 

95151 0-2 (1 — 0.01) 0.00290 
Using the Poisson PMF with A = np = 100-0.01 = 1, this probability is 
approximated by 


1 
= eee 
e€ a 0.00306. 
We provide a formal justification of the Poisson approximation property 
in the end-of-chapter problems and also in Chapter 5, where we will further 
interpret it, extend it, and use it in the context of the Poisson process. 


FUNCTIONS OF RANDOM VARIABLES 


Consider a probability model of today’s weather, let the random variable X 
be the temperature in degrees Celsius, and consider the transformation Y = 
1.8X + 32, which gives the temperature in degrees Fahrenheit. In this example, 
Y is a linear function of X, of the form 


Y = 9(X) =aX +), 
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where a and b are scalars. We may also consider nonlinear functions of the 


general form 
Y = g(X). 


For example, if we wish to display temperatures on a logarithmic scale, we would 
want to use the function g(X) = log X. 

If Y = g(X) is a function of a random variable X, then Y is also a random 
variable, since it provides a numerical value for each possible outcome. This is 
because every outcome in the sample space defines a numerical value x for X 
and hence also the numerical value y = g(x) for Y. If X is discrete with PMF 
px, then Y is also discrete, and its PMF py can be calculated using the PMF 
of X. In particular, to obtain py(y) for any y, we add the probabilities of all 
values of x such that g(x) = y: 


pr(y)= Sd. px(2). 
{x | g(x)=y} 


Example 2.1. Let Y = |X| and let us apply the preceding formula for the PMF 
py to the case where 


px(x) = { 1/9 ifzis an integer in the range [—4, 4], 
0 otherwise. 


The possible values of Y are y = 0,1,2,3,4. To compute py(y) for some given 
value y from this range, we must add px(za) over all values x such that |z| = y. In 
particular, there is only one value of X that corresponds to y = 0, namely x = 0. 
Thus, 


1 
py (0) = px(0) = 9- 
Also, there are two values of X that correspond to each y = 1, 2,3, 4, so for example, 
2 
py (1) = px(-1) + px(1) 9° 


Thus, the PMF of Y is 
2/9 ify =1,2,3,4, 
py(y)=4 1/9 ify=0, 
0 otherwise. 


For another related example, let Z = X’. To obtain the PMF of Z, we 
can view it either as the square of the random variable X or as the square of the 
random variable Y. By applying the formula pz(z) = Dae px(x) or the 


formula pz(z) = ae jg2aey PY (y), we obtain 


2/9 if z=1,4,9,16, 
pz(z)=4 1/9 ifz=0, 


0 otherwise. 
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Figure 2.7: The PMFs of X and Y = |X| in Example 2.1. 


2.4 EXPECTATION, MEAN, AND VARIANCE 


The PMF of a random variable X provides us with several numbers, the proba- 
bilities of all the possible values of X. It would be desirable to summarize this 
information in a single representative number. This is accomplished by the ex- 
pectation of X, which is a weighted (in proportion to probabilities) average of 
the possible values of X. 

As motivation, suppose you spin a wheel of fortune many times. At each 
spin, one of the numbers m1, mz2,...,7™n comes up with corresponding proba- 
bility p1,p2,--.,Pn, and this is your monetary reward from that spin. What is 
the amount of money that you “expect” to get “per spin”? The terms “expect” 
and “per spin” are a little ambiguous, but here is a reasonable interpretation. 

Suppose that you spin the wheel k times, and that k; is the number of times 
that the outcome is m;. Then, the total amount received is miki + m2k2+---+ 
Mnkn. The amount received per spin is 


miki + make +-+++mnkn 


M= 
k 


If the number of spins k is very large, and if we are willing to interpret proba- 
bilities as relative frequencies, it is reasonable to anticipate that m; comes up a 
fraction of times that is roughly equal to pi: 


2 


Pi 
Thus, the amount of money per spin that you “expect” to receive is 


miky + make +++++ mnkn 


M= a mp1 + Mep2 +--+ + MnDPn. 


Motivated by this example, we introduce an important definition. 
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Expectation 


We define the expected value (also called the expectation or the mean) 
of a random variable X, with PMF px(z), byt 


E[X] = 5° xpx(a). 


Example 2.2. Consider two independent coin tosses, each with a 3/4 probability 
of a head, and let X be the number of heads obtained. This is a binomial random 
variable with parameters n = 2 and p = 3/4. Its PMF is 


(1/4)? if k =0, 
px(k) = {20-69 ifk=1, 
(3/4)? if k = 2, 


so the mean is 


It is useful to view the mean of X as a “representative” value of X, which 
lies somewhere in the middle of its range. We can make this statement more 
precise, by viewing the mean as the center of gravity of the PMF, in the sense 
explained in Fig. 2.8. 


7 When dealing with random variables that take a countably infinite num- 
ber of values, one has to deal with the possibility that the infinite sum }>, xpx (x) 
is not well-defined. More concretely, we will say that the expectation is well- 
defined if 57, |z|px(x) < co. In that case, it is known that the infinite sum 
>>, 2px (x) converges to a finite value that is independent of the order in which 
the various terms are summed. 

For an example where the expectation is not well-defined, consider a ran- 
dom variable X that takes the value 2* with probability 2-*, for k = 1,2,.... 
For a more subtle example, consider the random variable X that takes the val- 
ues 2* and —2* with probability 2-*, for k = 2,3,.... The expectation is again 
undefined, even though the PMF is symmetric around zero and one might be 
tempted to say that EX] is zero. 

Throughout this book, in lack of an indication to the contrary, we implicitly 
assume that the expected value of the random variables of interest is well-defined. 
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Center of Gravity 
c= Mean E[X] 


Figure 2.8: Interpretation of the mean as a center of gravity. Given a bar with 
a weight px(ax) placed at each point x with px (x) > 0, the center of gravity c is 
the point at which the sum of the torques from the weights to its left are equal 
to the sum of the torques from the weights to its right, that is, 


S\(@ -e)px(2) = 0, or c=) apx(2), 


x 


and the center of gravity is equal to the mean E[X]. 


There are many other quantities that can be associated with a random 
variable and its PMF. For example, we define the 2nd moment of the random 
variable X as the expected value of the random variable X?. More generally, we 
define the nth moment as E[X”], the expected value of the random variable 
X”, With this terminology, the 1st moment of X is just the mean. 

The most important quantity associated with a random variable X, other 
than the mean, is its variance, which is denoted by var(X) and is defined as 


the expected value of the random variable (X — E[X 1)’, Le., 
var(X) = E[(X — E[X])’]. 


Since (X — ELX y) can only take nonnegative values, the variance is always 
nonnegative. 

The variance provides a measure of dispersion of X around its mean. An- 
other measure of dispersion is the standard deviation of X, which is defined 
as the square root of the variance and is denoted by ox: 


ox = v/var(X). 


The standard deviation is often easier to interpret, because it has the same units 
as X. For example, if X measures length in meters, the units of variance are 
square meters, while the units of the standard deviation are meters. 

One way to calculate var(X), is to use the definition of expected value, 


after calculating the PMF of the random variable (X — E[X])”. This latter 
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random variable is a function of X, and its PMF can be obtained in the manner 


discussed in the preceding section. 


Example 2.3. Consider the random variable X of Example 2.1, which has the 
PMF 


p= { 1/9 if a is an integer in the range [—4, 4], 
0 otherwise. 


The mean E[X] is equal to 0. This can be seen from the symmetry of the PMF of 
X around 0, and can also be verified from the definition: 


E[X] = So apx(2) = ; S c=0. 


x24 
Let Z = (x - E[X])” = X?. As in Example 2.1, we obtain 


2/9 if z=1,4,9,16, 
pz(z)=4 1/9 ifz=0, 


0 otherwise. 


The variance of X is then obtained by 


It turns out that there is an easier method to calculate var(X), which uses 


the PMF of X but does not require the PMF of (X — E[X])’. This method is 
based on the following rule. 


Expected Value Rule for Functions of Random Variables 


Let X be a random variable with PMF px(x), and let g(X) be a real- 
valued function of X. Then, the expected value of the random variable 
g(X) is given by 

E[9(X)] = 5° 9(2)px (2). 
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To verify this rule, we use the formula py (y) = ) 742.) g(c)=y} PX (@) derived 
in the preceding section, we have 


= 5" ypy(y) 
=Sov dX px) 


y  {x| g(x)=y} 


= 5. Se ypx (2) 


y {x|g9(x)=y} 


=> DX ale)px@) 


y {x|g(x)=y} 


= So 9(x)px(2). 
Using the expected value rule, we can write the variance of X as 
var(X) = E|(X - E[X])"] = }0( - E[X])"px(2). 
Similarly, the nth moment is given by 
E[X”] = S- Lr’ px (x), 


and there is no need to calculate the PMF of X”. 


Example 2.3. (Continued) For the random variable X with PMF 


1/9 if x is an integer in the range [—4, 4], 
px(a) = { / 8 ge [ ] 


0 otherwise, 


we have 


i 
= 5(16+94+4+1+0+144+9+4 16) 


_ 60 
mi 


16 Discrete Random Variables Chap. 2 
which is consistent with the result obtained earlier. 


As we have noted earlier, the variance is always nonnegative, but could it 
be zero? Since every term in the formula }>, (% — E[X])*px (x) for the variance 
is nonnegative, the sum is zero if and only if (2 — ELX])2px(«) = 0 for every z. 
This condition implies that for any x with px(a) > 0, we must have x = E[X] 
and the random variable X is not really “random”: its experimental value is 
equal to the mean E[X], with probability 1. 


Variance 

The variance var(X) of a random variable X is defined by 
var(X) = E[(X — E[X])”] 

and can be calculated as 


var(X) = (2 — E[X])*px(2). 


x 


It is always nonnegative. Its square root is denoted by ox and is called the 
standard deviation. 


Let us now use the expected value rule for functions in order to derive some 
important properties of the mean and the variance. We start with a random 
variable X and define a new random variable Y, of the form 


Y=aX +), 


where a and 0 are given scalars. Let us derive the mean and the variance of the 
linear function Y. We have 


E/Y] = S (ax + b)px (x) = a> xpx (a) + bS > px(x) = aE[X] +b. 


x 


Furthermore, 
var(Y) = So (ax +b—E[ax + b}) px (2) 


= So (ax + b—aE[X] — b) "px (a) 


=a? S*(« — E[X])*px(2) 


= a’var(X). 
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Mean and Variance of a Linear Function of a Random Variable 


Let X be a random variable and let 
Y=aX +8, 
where a and 6 are given scalars. Then, 


E[Y] = aE[X] + 8, var(Y) = a?var(X). 


Let us also give a convenient formula for the variance of a random variable 
X with given PMF. 


Variance in Terms of Moments Expression 


var(X) = E[X?] — (E[X])’. 


This expression is verified as follows: 


We will now derive the mean and the variance of a few important random 
variables. 


Example 2.4. Mean and Variance of the Bernoulli. Consider the experiment 
of tossing a biased coin, which comes up a head with probability p and a tail with 
probability 1 — p, and the Bernoulli random variable X with PMF 


_ fp ifk=1, 
px(hy= {2 ifk =0. 
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Its mean, second moment, and variance are given by the following calculations: 
E[X] =1-p+0-(1—p) =p, 
E[X*] = 1? -p+0-(1—p) =p, 


Example 2.5. Discrete Uniform Random Variable. What is the mean and 
variance of the roll of a fair six-sided die? If we view the result of the roll as a 
random variable X, its PMF is 


1/6 ifk=1,2,3,4,5,6 

k)= { 9S I 

px(k) 0 otherwise. 

Since the PMF is symmetric around 3.5, we conclude that E[X] = 3.5. Regarding 
the variance, we have 


var(X) = E[X”] — (E[X]) 


(PoP 8 pod 6) 385) 


which yields var(X) = 35/12. 

The above random variable is a special case of a discrete uniformly dis- 
tributed random variable (or discrete uniform for short), which by definition, 
takes one out of a range of contiguous integer values, with equal probability. More 
precisely, this random variable has a PMF of the form 


1 
ifk= + 1,...,0, 
Px) Soha 
0 otherwise, 


where a and 0 are two integers with a < ); see Fig. 2.9. 
The mean is 
E[X] = a+ e 
2 
as can be seen by inspection, since the PMF is symmetric around (a + b)/2. To 
calculate the variance of X, we first consider the simpler case where a = 1 and 


b=n. It can be verified by induction on n that 


E[X?] = yee - a(n + 1)(2n +1). 


We leave the verification of this as an exercise for the reader. The variance can now 
be obtained in terms of the first and second moments 


var(X) = E[X?] — (ELX])” 


1 1 2 
= gin + YQn+ 1) -Fnr+)) 
1 
= oN + 1)(4n + 2 — 3n — 3) 
n—1 
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Figure 2.9: PMF of the discrete random variable that is uniformly dis- 
tributed between two integers a and b. Its mean and variance are 


, moje 8 Sere 


By 12 


For the case of general integers a and b, we note that the uniformly distributed 
random variable over [a,b] has the same variance as the uniformly distributed ran- 
dom variable over the interval [1,b — a+ 1], since these two random variables differ 
by the constant a—1. Therefore, the desired variance is given by the above formula 
with n = b—a-+1, which yields 


— (b—a+1)?-1_ (b—a)\(b—a+2) 
var(X) 3 19 : 


Example 2.6. The Mean of the Poisson. The mean of the Poisson PMF 


k =0,1,2,..., 


ai k 
= S- call the k = 0 term is zero 


Oy let m=k-1 
m 
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The last equality is obtained by noting that > _, eo = Or, px(m) = 1 is 
the normalization property for the Poisson PMF. 

A similar calculation shows that the variance of a Poisson random variable is 
also X (see the solved problems). We will have the occasion to derive this fact in a 


number of different ways in later chapters. 


Expected values often provide a convenient vehicle for choosing optimally 
between several candidate decisions that result in different expected rewards. If 
we view the expected reward of a decision as its “average payoff over a large 
number of trials,” it is reasonable to choose a decision with maximum expected 
reward. The following is an example. 


Example 2.7. The Quiz Problem. This example, when generalized appro- 
priately, is a prototypical model for optimal scheduling of a collection of tasks that 
have uncertain outcomes. 

Consider a quiz game where a person is given two questions and must decide 
which question to answer first. Question 1 will be answered correctly with proba- 
bility 0.8, and the person will then receive as prize $100, while question 2 will be 
answered correctly with probability 0.5, and the person will then receive as prize 
$200. If the first question attempted is answered incorrectly, the quiz terminates, 
i.e., the person is not allowed to attempt the second question. If the first question 
is answered correctly, the person is allowed to attempt the second question. Which 
question should be answered first to maximize the expected value of the total prize 
money received? 

The answer is not obvious because there is a tradeoff: attempting first the 
more valuable but also more difficult question 2 carries the risk of never getting a 
chance to attempt the easier question 1. Let us view the total prize money received 
as a random variable X, and calculate the expected value E[X] under the two 
possible question orders (cf. Fig. 2.10): 


0.2/7 $0 0.57% 9 
0.5/7 $100 $200 
0.8 
0.5 $300 08 §300 
Question 1 Question 2 
Answered 1st Answered ist 


Figure 2.10: Sequential description of the sample space of the quiz problem 
for the two cases where we answer question 1 or question 2 first. 
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(a) Answer question 1 first: Then the PMF of X is (cf. the left side of Fig. 2.10) 
px (0) = 0.2, px (100) = 0.8 - 0.5, px (300) = 0.8- 0.5, 
and we have 


E[X] = 0.8-0.5- 100 + 0.8 - 0.5 - 300 = $160. 


(b) Answer question 2 first: Then the PMF of X is (cf. the right side of Fig. 2.10) 
px(0)=0.5, px (200) =0.5-0.2, px (300) = 0.5- 0.8, 
and we have 


E[X] = 0.5- 0.2- 200 + 0.5- 0.8 - 300 = $140. 


Thus, it is preferable to attempt the easier question 1 first. 

Let us now generalize the analysis. Denote by p; and p2 the probabilities 
of correctly answering questions 1 and 2, respectively, and by vi and v2 the corre- 
sponding prizes. If question 1 is answered first, we have 

E[X] = pi(1 — pa)vr + pipa(v1 + v2) = prvi + pip2ve, 
while if question 2 is answered first, we have 
E[X] = po(1 — pi)v2 + pepi(v2 + v1) = pave + papier. 
It is thus optimal to answer question 1 first if and only if 
pivi + pip2v2 = p2v2 + p2piri, 
or equivalently, if 


Pivi 5 _p2va 
l-pi 1—pe2 


Thus, it is optimal to order the questions in decreasing value of the expression 
pu/(1 —p), which provides a convenient index of quality for a question with prob- 
ability of correct answer p and value v. Interestingly, this rule generalizes to the 
case of more than two questions (see the end-of-chapter problems). 


We finally illustrate by example a common pitfall: unless g(X) is a linear 
function, it is not generally true that E[g(X)] is equal to g(E[X]). 


Example 2.8. Average Speed Versus Average Time. If the weather is good 
(which happens with probability 0.6), Alice walks the 2 miles to class at a speed of 
V = 5 miles per hour, and otherwise drives her motorcycle at a speed of V = 30 
miles per hour. What is the mean of the time T to get to class? 
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The correct way to solve the problem is to first derive the PMF of T, 


ays 0.6 if t = 2/5 hours, 
Pr!) =) 0.4 if t = 2/30 hours, 


and then calculate its mean by 


4 
E[T] = 0.6- 5 + 0.4: 30-15 hours. 


However, it is wrong to calculate the mean of the speed V, 
E[V] = 0.6.5 +0.4- 30 = 15 miles per hour, 


and then claim that the mean of the time T is 


—— = —h 
E(V] ~ 15 ours 
To summarize, in this example we have 
2 2 2 
PS d E/T] =E/= ae 
me eae E 7] 4am 


2.5 JOINT PMFS OF MULTIPLE RANDOM VARIABLES 


Probabilistic models often involve several random variables of interest. For exam- 
ple, in a medical diagnosis context, the results of several tests may be significant, 
or in a networking context, the workloads of several gateways may be of interest. 
All of these random variables are associated with the same experiment, sample 
space, and probability law, and their values may relate in interesting ways. This 
motivates us to consider probabilities involving simultaneously the numerical val- 
ues of several random variables and to investigate their mutual couplings. In this 
section, we will extend the concepts of PMF and expectation developed so far to 
multiple random variables. Later on, we will also develop notions of conditioning 
and independence that closely parallel the ideas discussed in Chapter 1. 

Consider two discrete random variables X and Y associated with the same 
experiment. The joint PMF of X and Y is defined by 


px,y(z,y) = P(X =2,Y =y) 


for all pairs of numerical values (x, y) that X and Y can take. Here and elsewhere, 
we will use the abbreviated notation P(X = 2, Y = y) instead of the more precise 
notations P({X =a}N{Y = y}) or P(X =a and Y =2). 
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The joint PMF determines the probability of any event that can be specified 
in terms of the random variables X and Y. For example if A is the set of all 
pairs (x,y) that have a certain property, then 


P((X,Y) € A) = So pxy(z,y). 
(x,y)EA 
In fact, we can calculate the PMFs of X and Y by using the formulas 


=) pxy(z,y), py (y) = So pxy(a,y). 


The formula for px (a) can be verified using the calculation 
px(x) = P(X =z) 


= S P(X =2,Y =y) 
= So px,y(2,y); 


where the second equality follows by noting that the event {X = x} is the union 
of the disjoint events {X = 2, Y = y} as y ranges over all the different values of 
Y. The formula for py(y) is verified similarly. We sometimes refer to px and 
py as the marginal PMFs, to distinguish them from the joint PMF. 

The example of Fig. 2.11 illustrates the calculation of the marginal PMFs 
from the joint PMF by using the tabular method. Here, the joint PMF of X 
and Y is arranged in a two-dimensional table, and the marginal PMF of X 
or Y at a given value is obtained by adding the table entries along a 
corresponding column or row, respectively. 


Functions of Multiple Random Variables 


When there are multiple random variables of interest, it is possible to generate 
new random variables by considering functions involving several of these random 
variables. In particular, a function Z = g(X,Y) of the random variables X and 
Y defines another random variable. Its PMF can be calculated from the joint 
PMF px.y according to 


pa(z) = bs pxy (x,y). 
{(z,y) | 9(@.y)=2} 

Furthermore, the expected value rule for functions naturally extends and takes 
the form 

E[g(X,Y)] = S— 9(2,y)px,y (2,9). 

ny 

The verification of this is very similar to the earlier case of a function of a single 
random variable. In the special case where g is linear and of the form aX +bY +c, 
where a, b, and c are given scalars, we have 


E[aX + bY +c] = aE[X] + bE[Y] +c 
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Joint PMF Py (x,y) 
in tabular form 


4} 9 |1/20/1/20]1/20 3/20 


7/20 
3 | 1/20] 2/20 | 3/20 | 1/20 Row Sums: 


Marginal PMF Pyy) 


2 | 1/20] 2/20] 3/20]1/20 7/20 


1 | 1/20] 1/20} 1/20] 0 3/20 


3/20 6/20 8/20 3/20 


Column Sums: 
Marginal PMF P,{x) 


Figure 2.11: Illustration of the tabular method for calculating marginal PMFs 
from joint PMFs. The joint PMF is represented by a table, where the number in 
each square (x,y) gives the value of px y(«,y). To calculate the marginal PMF 
px (a) for a given value of x, we add the numbers in the column corresponding to 
x. For example px (2) = 8/20. Similarly, to calculate the marginal PMF py (y) 
for a given value of y, we add the numbers in the row corresponding to y. For 
example py (2) = 5/20. 


More than Two Random Variables 


The joint PMF of three random variables X, Y, and Z is defined in analogy with 
the above as 


px.y,z(2, Y, Z) P(X x,Y y,Z 2), 


for all possible triplets of numerical values (a, y,z). Corresponding marginal 
PMFs are analogously obtained by equations such as 


px y (x,y) = So px y.z(asy, 2), 
Zz 


and 


px(2)=_ >) px,y,z(a,y; 2). 


The expected value rule for functions takes the form 


E[9(X, Y, Z)] = > G(X, y, Z)px,y,z(2, Y, 2), 


LY,z 
and if g is linear and of the form aX + bY + cZ +d, then 
E[aX + bY + cZ + d| = aE[X] + bE[Y] + cE[Z] + d. 
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Furthermore, there are obvious generalizations of the above to more than three 
random variables. For example, for any random variables X,, X2,...,Xn and 
any scalars a1, @2,...,@n, we have 


Bla. X1 + a2Xo +--+ + anXn] = a1 E[X1] + a2E[X2] +--+ + anE[Xy]. 


Example 2.9. Mean of the Binomial. Your probability class has 300 students 
and each student has probability 1/3 of getting an A, independently of any other 
student. What is the mean of X, the number of students that get an A? Let 


$82 { 1 if the ith student gets an A, 
: 0 otherwise. 


Thus Xj, X2,...,Xn are Bernoulli random variables with common mean p = 1/3 
and variance p(1 — p) = (1/3)(2/3) = 2/9. Their sum 
eas Cee Cee Se 


is the number of students that get an A. Since X is the number of “successes” in 
n independent trials, it is a binomial random variable with parameters n and p. 
Using the linearity of X as a function of the X;, we have 


300 300 1 1 
E[X] =) E[Xi] =)> 3 = 300-5 = 100. 
i=1 a1 


If we repeat this calculation for a general number of students n and probability of 
A equal to p, we obtain 


Example 2.10. The Hat Problem. Suppose that n people throw their hats in 
a box and then each picks up one hat at random. What is the expected value of 
X, the number of people that get back their own hat? 

For the ith person, we introduce a random variable X; that takes the value 
1 if the person selects his/her own hat, and takes the value 0 otherwise. Since 
P(X; = 1) = 1/n and P(X; = 0) = 1—1/n, the mean of X; is 


We now have 
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so that 


E[X] = E[Xi] + B[X2] + +--+ E[X,J=n-—=1. 


Sle 


Summary of Facts About Joint PMFs 
Let X and Y be random variables associated with the same experiment. 


e The joint PMF of X and Y is defined by 


Px,y(2,y) P(X 2£,Y y)- 


e The marginal PMFs of X and Y can be obtained from the joint PMF, 
using the formulas 


px(z) = opxy(a,y), — pry) = So pxy(2,y)- 


e A function g(X,Y) of X and Y defines another random variable, and 


E[g(X,Y)] = > ¢ 9(2,y)px,y (2,9). 


xy 


If g is linear, of the form aX + bY +c, we have 


E[aX + bY +c] = aB[X] + bE[Y] +c. 


e The above have natural extensions to the case where more than two 
random variables are involved. 


CONDITIONING 


If we have a probabilistic model and we are also told that a certain event A has 
occurred, we can capture this knowledge by employing the conditional instead of 
the original (unconditional) probabilities. As discussed in Chapter 1, conditional 
probabilities are like ordinary probabilities (satisfy the three axioms) except that 
they refer to a new universe in which event A is known to have occurred. In the 
same spirit, we can talk about conditional PMF’s which provide the probabilities 
of the possible values of a random variable, conditioned on the occurrence of 
some event. This idea is developed in this section. In reality though, there is 
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not much that is new, only an elaboration of concepts that are familiar from 
Chapter 1, together with a fair dose of new notation. 


Conditioning a Random Variable on an Event 


The conditional PMF of a random variable X, conditioned on a particular 
event A with P(A) > 0, is defined by 


P({X =a} A) 

P(A) 
Note that the events {X = x} A are disjoint for different values of x, their 
union is A, and, therefore, 


P(A) = P({X =a} A). 


px|a(@) = P(X =2| A) 


Combining the above two formulas, we see that 
S> pxia(2) = 1, 
x 


so px|a is a legitimate PMF. 
As an example, let X be the roll of a die and let A be the event that the 
roll is an even number. Then, by applying the preceding formula, we obtain 
px|a(z) = P(X = a |roll is even) 
P(X =a and X is even) 
P(roll is even) 
_ { 1/3 if =2,4,6, 
~ lo otherwise. 
The conditional PMF is calculated similar to its unconditional counterpart: 
to obtain px),4(x), we add the probabilities of the outcomes that give rise to 


X =z and belong to the conditioning event A, and then normalize by dividing 
with P(A) (see Fig. 2.12). 


Event {X=x Pya(XlA) 


Sample Space 
Q 


Figure 2.12: Visualization and calculation of the conditional PMF Px|a(2). For 
each x, we add the probabilities of the outcomes in the intersection {X = a}NA 
and normalize by diving with P(A). 
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Conditioning one Random Variable on Another 


Let X and Y be two random variables associated with the same experiment. If 
we know that the experimental value of Y is some particular y (with py(y) > 0), 
this provides partial knowledge about the value of X. This knowledge is captured 
by the conditional PMF px\y of X given Y, which is defined by specializing 
the definition of px), to events A of the form {Y = y}: 

Pxiy(@ ly) =P(X =2|Y =y). 
Using the definition of conditional probabilities, we have 

P(X =2z,Y =y) Pxy(@,Y) 
px\y (z| y) P(Y =y) pr(y) 

Let us fix some y, with py(y) > 0 and consider px|y(2|y) as a function 
of x. This function is a valid PMF for X: it assigns nonnegative values to each 
possible x, and these values add to 1. Furthermore, this function of xz, has the 
same shape as px,y(x,y) except that it is normalized by dividing with py(y), 
which enforces the normalization property 


So pxiy (a ly) =1. 


Figure 2.13 provides a visualization of the conditional PMF. 


Conditional PMF 
4 Pxyx13) 


"SLICE VIEW" 
of Conditional PMF 


px yxy) fs 
Conditional PMF 
4 Pxyytxl 2) 


Conditional PMF 
4 Pxyixl 1) 


| | = 


PMF py x,y) 


Figure 2.13: Visualization of the conditional PMF px y(a#|y). For each y, we 
view the joint PMF along the slice Y = y and renormalize so that 


So pxiy (2/9) =1. 
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The conditional PMF is often convenient for the calculation of the joint 
PMF, using a sequential approach and the formula 


PX,Y(@,Y) = py(y)pxiy(@|y), 


or its counterpart 
Px,y (x,y) = px (x)py|x (y|2). 


This method is entirely similar to the use of the multiplication rule from Chap- 
ter 1. The following examples provide an illustration. 


Example 2.11. Professor May B. Right often has her facts wrong, and answers 
each of her students’ questions incorrectly with probability 1/4, independently of 
other questions. In each lecture May is asked 0, 1, or 2 questions with equal proba- 
bility 1/3. Let X and Y be the number of questions May is asked and the number of 
questions she answers wrong in a given lecture, respectively. To construct the joint 
PMF px,y(a,y), we need to calculate all the probabilities P(X = 2, Y = y) for all 
combinations of values of x and y. This can be done by using a sequential descrip- 
tion of the experiment and the multiplication rule px,y (x,y) = py (y)pxiy(«|y), 
as shown in Fig. 2.14. For example, for the case where one question is asked and is 
answered wrong, we have 


px,y (1,1) = px(«)py|x(y|2) = 


The joint PMF can be represented by a two-dimensional table, as shown in Fig. 
2.14. It can be used to calculate the probability of any event of interest. For 
instance, we have 


P(at least one wrong answer) = px,y(1,1) + px,y (2,1) + px,y (2, 2) 
4 6 1 


Example 2.12. Consider four independent rolls of a 6-sided die. Let X be the 
number of 1’s and let Y be the number of 2’s obtained. What is the joint PMF of 
X and Y? 

The marginal PMF py is given by the binomial formula 


py (y) = (*) eCeeme y=0,1,...,4. 


To compute the conditional PMF px\y, note that given that Y = y, X is the 
number of 1’s in the remaining 4 — y rolls, each of which can take the 5 values 
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Prob: 1/48 


Prob: 6/48 


Prob: 9/48 


mooiane 2 | 9 | 0 148 
1 0 |aiae 6/48 
Prob: 12/48 
= . 0 He/48h 2/48] 9/48 
0 1 2 


Prob: 16/48 


Joint PMF Py x,y) 


X : Number of Y : Number of ; 
in tabular form 


questions asked questions answered 
wrong 


Figure 2.14: Calculation of the joint PMF px y(2,y) in Example 2.11. 


1,3,4,5,6 with equal probability 1/5. Thus, the conditional PMF px y is binomial 
with parameters 4 — y and p = 1/5: 


ave = (“49 CY (. 


for all x and y such that x, y= 0,1,...,4, and 0 <a2+y< 4. The joint PMF is 
now given by 


px,y (x,y) = py (y)pxiy («| y) 
“AYO CAE 


for all nonnegative integers x and y such that 0 < «+y < 4. For other values of x 
and y, we have px,y(,y) = 0. 


The conditional PMF can also be used to calculate the marginal PMFs. In 


particular, we have by using the definitions, 


= LPxv ey) = Lvly )pxy («| y)- 


This formula provides a divide-and-conquer method for calculating marginal 
PMFs. It is in essence identical to the total probability theorem given in Chap- 
ter 1, but cast in different notation. The following example provides an illustra- 
tion. 
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Example 2.13. Consider a transmitter that is sending messages over a computer 
network. Let us define the following two random variables: 


X : the travel time of a given message, Y : the length of the given message. 


We know the PMF of the travel time of a message that has a given length, and we 
know the PMF of the message length. We want to find the (unconditional) PMF 
of the travel time of a message. 

We assume that the length of a message can take two possible values: y = 10° 
bytes with probability 5/6, and y = 10‘ bytes with probability 1/6, so that 


_ [5/6 if y=10?, 
p= 4 ify = 10. 


We assume that the travel time X of the message depends on its length Y and 
the congestion level of the network at the time of transmission. In particular, the 
travel time is 10~*Y secs with probability 1/2, 10~°Y secs with probability 1/3, 
and 10~?Y secs with probability 1 /6. Thus, we have 


1/2 if*=107, 1/2 ifa=1, 
px\y(e|107) = ¢ 1/3 ife=1071, px\y(x|10*) = ¢ 1/3 if 2 =10, 
1/6 ife=1, 1/6 if « = 100. 


To find the PMF of X, we use the total probability formula 


px(x) = >> py(y)pxiy (v|y)- 


We obtain 


= 5 
px (10 AY ies 6* 


Note finally that one can define conditional PMFs involving more than 


two random variables, as in px y\z(,y|z) or px|y,z(v|y, z). The concepts and 
methods described above generalize easily (see the end-of-chapter problems). 
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Summary of Facts About Conditional PMFs 
Let X and Y be random variables associated with the same experiment. 


e Conditional PMFs are similar to ordinary PMFs, but refer to a uni- 
verse where the conditioning event is known to have occurred. 


e The conditional PMF of X given an event A with P(A) > 0, is defined 
by 
pxja(@) = P(X =2| A) 


and satisfies 


S > pxja(x) = 1. 


e The conditional PMF of X given Y = y is related to the joint PMF 
by 
pxy (x,y) = py(y)pxiy (| y)- 
This is analogous to the multiplication rule for calculating probabilities 
and can be used to calculate the joint PMF from the conditional PMF. 


e The conditional PMF of X given Y can be used to calculate the 
marginal PMFs with the formula 


px(z) = So py (y)pxyy (a | y). 


y 
This is analogous to the divide-and-conquer approach for calculating 
probabilities using the total probability theorem. 


e There are natural extensions to the above involving more than two 
random variables. 


Conditional Expectation 


A conditional PMF can be thought of as an ordinary PMF over a new uni- 
verse determined by the conditioning event. In the same spirit, a conditional 
expectation is the same as an ordinary expectation, except that it refers to the 
new universe, and all probabilities and PMFs are replaced by their conditional 
counterparts. We list the main definitions and relevant facts below. 
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Summary of Facts About Conditional Expectations 
Let X and Y be random variables associated with the same experiment. 


e The conditional expectation of X given an event A with P(A) > 0, is 
defined by 


B[X | A] = Dx (x| A). 
For a function g(X), it is given by 


E[9(X) | A] a S> 9(x)pxya(a | A). 


e The conditional expectation of X given a value y of Y is defined by 


B[X|Y =y] = D_aexiv (ly) 


e We have 
= Lprty) E[X|Y = y}. 


This is the total expectation theorem. 


e Let Ai,...,An be disjoint events that form a partition of the sample 
space, and assume that P(A;) > 0 for all i. Then, 


= SPC E[X | Ai]. 


Let us verify the total expectation theorem, which basically says that “the 
unconditional average can be obtained by averaging the conditional averages.” 
The theorem is derived using the total probability formula 


px(z) = Spy (y)pxyy (a ly) 


¥y 
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and the calculation 


X] = 50 apx(a) 
= eT pvinexirl |_y) 
= Ym dX rpxiy («| y) 
-S mo B[X |Y =y]. 


The relation E[X] = )>j_, P(Ai)E[X | Aj] can be verified by viewing it as 
a special case of the total expectation theorem. Let us introduce the random 
variable Y that takes the value 7 if and only if the event A; occurs. Its PMF is 


given by 
0 otherwise. 
The total expectation theorem yields 


= P(A E[X | Y = i], 


and since the event {Y = 7} is just A;, we obtain the desired expression 
= oP P(A;)E[X | Aj). 


The total expectation theorem is analogous to the total probability theo- 
rem. It can be used to calculate the unconditional expectation E[X] from the 
conditional PMF or expectation, using a divide-and-conquer approach. 


Example 2.14. Messages transmitted by a computer in Boston through a data 
network are destined for New York with probability 0.5, for Chicago with probability 
0.3, and for San Francisco with probability 0.2. The transit time X of a message is 
random. Its mean is 0.05 secs if it is destined for New York, 0.1 secs if it is destined 
for Chicago, and 0.3 secs if it is destined for San Francisco. Then, E[X] is easily 
calculated using the total expectation theorem as 


E[X] = 0.5- 0.05 + 0.3- 0.1 + 0.2- 0.3 = 0.115 secs. 


Example 2.15. Mean and Variance of the Geometric Random Variable. 
You write a software program over and over, and each time there is probability p 
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that it works correctly, independently from previous attempts. What is the mean 
and variance of X, the number of tries until the program works correctly? 
We recognize X as a geometric random variable with PMF 


px(k) = (1—p)*"'p, k=1,2,.... 
The mean and variance of X are given by 
E[X] = 5 > k(1-p)**p, — var(X) = 5 (ke - E[X])?(1- p)*"*p, 
k=1 k=1 


but evaluating these infinite sums is somewhat tedious. As an alternative, we will 
apply the total expectation theorem, with A; = {X = 1} = {first try is a success}, 
Ag = {X > 1} = {first try is a failure}, and end up with a much simpler calcula- 
tion. 

If the first try is successful, we have X = 1, and 


E[X|X =1=1. 


If the first try fails (X > 1), we have wasted one try, and we are back where we 
started. So, the expected number of remaining tries is E[X], and 


E[X|X > 1) =1+E[X]. 


Thus, 
EB[X] = P(X = 1IE[X |X =1)+ P(X > 1DE[X|X>]] 


=p+(1—p)(1+E[X]), 


from which we obtain 


With similar reasoning, we also have 
E[X?|X=1j=1,  E[X?|X >1])=E[(1+ X)?] =14 2E[X]+E[X”), 


so that 


E[X”] = p-1+ (1—p)(1+2E[X] + E[X”]), 
from which we obtain 
pyxt) = 1420 = EI 


and 
E[X?] = 


We conclude that 


2.7 
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INDEPENDENCE 


We now discuss concepts of independence related to random variables. These 
concepts are analogous to the concepts of independence between events (cf. Chap- 
ter 1). They are developed by simply introducing suitable events involving the 
possible values of various random variables, and by considering their indepen- 
dence. 


Independence of a Random Variable from an Event 


The independence of a random variable from an event is similar to the indepen- 
dence of two events. The idea is that knowing the occurrence of the conditioning 
event tells us nothing about the value of the random variable. More formally, 
we say that the random variable X is independent of the event A if 


P(X =a and A) = P(X = 2£)P(A) = px(x)P(A), for all x, 


which is the same as requiring that the two events {X = x} and A be in- 
dependent, for any choice x. As long as P(A) > 0, and using the definition 
px|a(@) = P(X = a and A)/P(A) of the conditional PMF, we see that indepen- 
dence is the same as the condition 


px\a(£) = px(z), for all x. 


Example 2.16. Consider two independent tosses of a fair coin. Let X be the 
number of heads and let A be the event that the number of heads is even. The 
(unconditional) PMF of X is 


1/4 ife=0, 
1/4 ife=2, 


and P(A) = 1/2. The conditional PMF is obtained from the definition px ;,4(#) = 
P(X =~ and A)/P(A): 


1/2 ifz=0, 
px|A() = 0 ife= Li 
1/2 ife=2. 


Clearly, X and A are not independent, since the PMFs px and px, are different. 
For an example of a random variable that is independent of A, consider the random 
variable that takes the value 0 if the first toss is a head, and the value 1 if the first 
toss is a tail. This is intuitively clear and can also be verified by using the definition 
of independence. 
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The notion of independence of two random variables is similar. We say that two 
random variables X and Y are independent if 


px,y (x,y) = px (x) py (y), for all x, y. 


This is the same as requiring that the two events {X = x} and {Y = y} be in- 
dependent for every x and y. Finally, the formula px,y (x,y) = pxjy (| y)py (y) 
shows that independence is equivalent to the condition 


px\y(«|y) = px(2), for all y with py(y) > 0 and all x. 


Intuitively, independence means that the experimental value of Y tells us nothing 
about the value of X. 

There is a similar notion of conditional independence of two random vari- 
ables, given an event A with P(A > 0. The conditioning event A defines a new 
universe and all probabilities (or PMFs) have to be replaced by their conditional 
counterparts. For example, X and Y are said to be conditionally indepen- 
dent, given a positive probability event A, if 


P(X =2,Y =y|A) =P(X =2| A)P(Y =y|A), for all x and y, 


or, in this chapter’s notation, 
Px,y|A(@,Y) = Px|a(2)pylaly), for all x and y. 
Once more, this is equivalent to 


pxiy,A(£|y) = pxja(2) for all x and y such that py; ,(y) > 0. 


As in the case of events (Section 1.4), conditional independence may not imply 
unconditional independence and vice versa. This is illustrated by the example 
in Fig. 2.15. 

If X and Y are independent random variables, then 


E[XY] = E[X]E(Y], 


as shown by the following calculation: 


E[XY] = iy S> rypx,y (x,y) 
ey 
= ys S- xypx(x)py (y) by independence 
ey 


= So px (x) >> ypy(y) 


= E[X]E[Y]. 
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|r 
ele 
[fiw 
eps 


Figure 2.15: Example illustrating that conditional independence may not imply 
unconditional independence. For the PMF shown, the random variables X and 
Y are not independent. For example, we have 


pxjy(1]1) = P(X =1/¥ =1) =04 P(X = 1) = px(1). 


On the other hand, conditional on the event A = {X < 2, Y > 3} (the shaded 
set in the figure), the random variables X and Y can be seen to be independent. 
In particular, we have 


1/3 if2=1, 
pxiy,a(@ly) = 2/3 ifx =2 


for both values y = 3 and y = 4. 


A very similar calculation also shows that if X and Y are independent, then 


for any functions g and h. In fact, this follows immediately once we realize that 
if X and Y are independent, then the same is true for g(X) and h(Y). This is 
intuitively clear and its formal verification is left as an end-of-chapter problem. 

Consider now the sum Z = X + Y of two independent random variables 
X and Y, and let us calculate the variance of Z. We have, using the relation 
E|X + Y] = E[X]+ E[Y], 


var(Z) = E 
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To justify the last equality, note that the random variables X —E[X] and Y—-E[Y] 
are independent (they are functions of the independent random variables X and 
Y, respectively) and 


B[(X — B[X]) (¥ - BY))] = B[(X - B[X)] B[(Y - BYY))] =o. 


We conclude that 
var(Z) = var(X) + var(Y). 


Thus, the variance of the sum of two independent random variables is equal to 
the sum of their variances. As an interesting contrast, note that the mean of the 
sum of two random variables is always equal to the sum of their means, even if 
they are not independent. 


Summary of Facts About Independent Random Variables 


Let A be an event, with P(A) > 0, and let X and Y be random variables 
associated with the same experiment. 


e X is independent of the event A if 


px\a(£) = px(z), for all a, 


that is, if for all x, the events {X = x2} and A are independent. 


e X and Y are independent if for all possible pairs (x,y), the events 
{X =} and {Y = y} are independent, or equivalently 


px,y(x,y) =px(«)py(y), for all x,y. 
e If X and Y are independent random variables, then 
E|XY] = E[X|E[Y]. 


Furthermore, for any functions f and g, the random variables g(X) 
and h(Y) are independent, and we have 


e If X and Y are independent, then 


var|X + Y] = var(X) + var(Y). 
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Independence of Several Random Variables 


All of the above have natural extensions to the case of more than two random 
variables. For example, three random variables X, Y, and Z are said to be 
independent if 


px,y,z(£,Y, z) = px (x)py (y)pz(z), for all Z,Y,%. 


If X, Y, and Z are independent random variables, then any three random 
variables of the form f(X), g(Y), and h(Z), are also independent. Similarly, any 
two random variables of the form g(X,Y) and h(Z) are independent. On the 
other hand, two random variables of the form g(X,Y) and A(Y, Z) are usually 
not independent, because they are both affected by Y. Properties such as the 
above are intuitively clear if we interpret independence in terms of noninter- 
acting (sub)experiments. They can be formally verified (see the end-of-chapter 
problems), but this is sometimes tedious. Fortunately, there is general agree- 
ment between intuition and what is mathematically correct. This is basically a 
testament that the definitions of independence we have been using adequately 
reflect the intended interpretation. 

Another property that extends to multiple random variables is the follow- 
ing. If X1, X2,...,Xn are independent random variables, then 


var(X1 + Xo +--+-+ Xn) = var(X1) + var(X2) +--+ + var(Xn). 


This can be verified by a calculation similar to the one for the case of two random 
variables and is left as an exercise for the reader. 


Example 2.17. Variance of the Binomial. We consider n independent coin 
tosses, with each toss having probability p of coming up a head. For each 7, we let 
X; be the Bernoulli random variable which is equal to 1 if the ith toss comes up 
a head, and is 0 otherwise. Then, X = X1 + X2+---+ Xy is a binomial random 
variable. By the independence of the coin tosses, the random variables X1,...,Xn 
are independent, and 


The formulas for the mean and variance of a weighted sum of random 
variables form the basis for many statistical procedures that estimate the mean 
of a random variable by averaging many independent samples. A typical case is 
illustrated in the following example. 


Example 2.18. Mean and Variance of the Sample Mean. We wish to 
estimate the approval rating of a president, to be called C. To this end, we ask n 
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persons drawn at random from the voter population, and we let X; be a random 
variable that encodes the response of the ith person: 


ez { 1 if the ith person approves C’s performance, 
‘(0 if the ith person disapproves C’s performance. 


We model X1, X2,..., Xn as independent Bernoulli random variables with common 
mean p and variance p(1 — p). Naturally, we view p as the true approval rating of 
C. We “average” the responses and compute the sample mean S7,, defined as 


— Xi + Xo+++++ Xn 
= : 


Sn 


Thus, S;, is the approval rating of C within our n-person sample. 
We have, using the linearity of S,, as a function of the X;, 


E[S,] = > BIX] =2 )p=p, 


i=l 


and making use of the independence of X1,..., Xn, 


var(Sn) = S- var X;) = PUES P) 


i=l 


The sample mean S;, can be viewed as a “good” estimate of the approval rating. 

This is because it has the correct expected value, which is the approval rating p, and 

its accuracy, as reflected by its variance, improves as the sample size n increases. 
Note that even if the random variables X; are not Bernoulli, the same calcu- 


lation yields 
var(Se) = oe) 
n 


as long as the X; are independent, with common mean E[X] and variance var(X). 
Thus, again, the sample mean becomes a very good estimate (in terms of variance) 
of the true mean E[X], as the sample size n increases. We will revisit the properties 
of the sample mean and discuss them in much greater detail in Chapter 7, when we 
discuss the laws of large numbers. 


Example 2.19. Estimating Probabilities by Simulation. In many practical 
situations, the analytical calculation of the probability of some event of interest is 
very difficult. However, if we have a physical or computer model that can generate 
outcomes of a given experiment in accordance with their true probabilities, we can 
use simulation to calculate with high accuracy the probability of any given event A. 
In particular, we independently generate with our model n outcomes, we record the 
number m that belong to the event A of interest, and we approximate P(A) by m/n. 
For example, to calculate the probability p = P(Heads) of a biased coin, we flip the 
coin n times, and we approximate p with the ratio (number of heads recorded) /n. 
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To see how accurate this process is, consider n independent Bernoulli random 
variables X1,..., Xn, each with PMF 


In a simulation context, X; corresponds to the ith outcome, and takes the value 1 
if the 7th outcome belongs to the event A. The value of the random variable 


_ Xi 4+ Xot-+Xn 
n 


xX 


is the estimate of P(A) provided by the simulation. According to Example 2.17, X 
has mean P(A) and variance P(A) (1 - P(A)) /n, so that for large n, it provides an 
accurate estimate of P(A). 


2.8 SUMMARY AND DISCUSSION 


Random variables provide the natural tools for dealing with probabilistic mod- 
els in which the outcome determines certain numerical values of interest. In 
this chapter, we focused on discrete random variables, and developed the main 
concepts and some relevant tools. We also discussed several special random vari- 
ables, and derived their PMF, mean, and variance, as summarized in the table 
that follows. 


Summary of Results for Special Random Variables 


Discrete Uniform over |[a, }]: 


1 
——  ifk= 1,...,0 
pxlt)=| Faas kee eae 
0 otherwise, 


B[x] = 22°, var(x) = CO = at?) 


Bernoulli with Parameter p: (Describes the success or failure in a single 
trial.) 
_ jp ifk =1, 
px) = 47 ifk =0, 
E[X]=p, __var(X) = p(1 — p). 
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Binomial with Parameters p and n: (Describes the number of successes 
in n independent Bernoulli trials.) 


px(k) = (j, ett = pyr k=0,1,...,n, 
E[X] = np, var(X) = np(1 — p). 


Geometric with Parameter p: (Describes the number of trials until the 
first success, in a sequence of independent Bernoulli trials.) 


px(k) = (1—p)*-1p, k=1,2,..., 


Poisson with Parameter \: (Approximates the binomial PMF when n 
is large, p is small, and \ = np.) 


Ny 


Ee? k=O yao, 


px(k) =e 


E[X] =), var(X) = A. 


We also considered multiple random variables, and introduced their joint 
and conditional PMF's, and associated expected values. Conditional PMF's are 
often the starting point in probabilistic models and can be used to calculate 
other quantities of interest, such as marginal or joint PMFs and expectations, 
through a sequential or a divide-and-conquer approach. In particular, given the 
conditional PMF pxyjy (a | y): 


(a) The joint PMF can be calculated by 
px,y (x,y) = py(y)Px|y(@|y). 
This can be extended to the case of three or more random variables, as in 
PXY,Z(2,Y, 2) = py (y)py\z(y | z)Px\y,g(© ly, z), 


and is analogous to the sequential tree-based calculation method using the 
multiplication rule, discussed in Chapter 1. 


(b) The marginal PMF can be calculated by 


px(x) = )— py(y)pxiy(x|y), 


y 
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which generalizes the divide-and-conquer calculation method we discussed 
in Chapter 1. 


(c) The divide-and-conquer calculation method in (b) above can be extended 
to compute expected values using the total expectation theorem: 


E[X] = ) 0 py (yE[X|¥ = y]. 


The concepts and methods of this chapter extend appropriately to general 
random variables (see the next chapter), and are fundamental for our subject. 


General Random Variables 
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. Continuous Random Variables and PDFs 
. Cumulative Distribution Functions 
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Random variables with a continuous range of possible experimental values are 
quite common — the velocity of a vehicle traveling along the highway could be one 
example. If such a velocity is measured by a digital speedometer, the speedome- 
ter’s reading is a discrete random variable. But if we also wish to model the exact 
velocity, a continuous random variable is called for. Models involving continuous 
random variables can be useful for several reasons. Besides being finer-grained 
and possibly more accurate, they allow the use of powerful tools from calculus 
and often admit an insightful analysis that would not be possible under a discrete 
model. 

All of the concepts and methods introduced in Chapter 2, such as expec- 
tation, PMFs, and conditioning, have continuous counterparts. Developing and 
interpreting these counterparts is the subject of this chapter. 


CONTINUOUS RANDOM VARIABLES AND PDFS 


A random variable X is called continuous if its probability law can be described 
in terms of a nonnegative function fx, called the probability density function 
of X, or PDF for short, which satisfies 


P(X € B)= [tx dx, 


for every subset B of the real line.T In particular, the probability that the value 
of X falls within an interval is 


b 
Piasx<w= fx (x) dx, 


and can be interpreted as the area under the graph of the PDF (see Fig. 3.1). 
For any single value a, we have P(X = a) = ts fx(a) dx = 0. For this reason, 
including or excluding the endpoints of an interval has no effect on its probability: 


P(ia<X <b)=P(a<X <b)=P(a<X <b)=P(a<X <b). 


Note that to qualify as a PDF, a function fx must be nonnegative, i.e., 
fx(a) > 0 for every x, and must also satisfy the normalization equation 


i fx(a) dz = P(-~w~w < X < w) = 1. 


+ The integral i fx(x) dx is to be interpreted in the usual calculus/Riemann 
sense and we implicitly assume that it is well-defined. For highly unusual functions 
and sets, this integral can be harder — or even impossible — to define, but such issues 
belong to a more advanced treatment of the subject. In any case, it is comforting 
to know that mathematical subtleties of this type do not arise if fx is a piecewise 
continuous function with a finite number of points of discontinuity, and B is the union 
of a finite or countable number of intervals. 
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Event {a< X<b} 


Figure 3.1: Illustration of a PDF. The probability that X takes value in an 
interval [a, 6] is it fx (a) dx, which is the shaded area in the figure. 


Graphically, this means that the entire area under the graph of the PDF must 
be equal to 1. 

To interpret the PDF, note that for an interval [2,2 + 6] with very small 
length 6, we have 


até 


P ([x, x + 6]) =| fx (t) dt & fx(x) - 6, 


x 


so we can view fx(x) as the “probability mass per unit length” near x (cf. 
Fig. 3.2). It is important to realize that even though a PDF is used to calculate 
event probabilities, fx(x) is not the probability of any particular event. In 
particular, it is not restricted to be less than or equal to one. 


PDF f(x 
x ) Figure 3.2: Interpretation of the PDF 


fx (x) as “probability mass per unit length” 
around z. If 6 is very small, the prob- 
ability that X takes value in the inter- 
val [x, xz + 6] is the shaded area in the 
figure, which is approximately equal to 


x x+6 foley. 


Example 3.1. Continuous Uniform Random Variable. A gambler spins 
a wheel of fortune, continuously calibrated between 0 and 1, and observes the 
resulting number. Assuming that all subintervals of [0,1] of the same length are 
equally likely, this experiment can be modeled in terms a random variable X with 
PDF 
ce ifO<a<l, 
Jain) = ‘0 otherwise, 
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for some constant c. This constant can be determined by using the normalization 


property 
co 1 i 
i= | fr(e)de= f cae=e | dz =c 
—oo 0 0) 
so that c= 1. 


More generally, we can consider a random variable X that takes values in 
an interval [a,b], and again assume that all subintervals of the same length are 
equally likely. We refer to this type of random variable as uniform or uniformly 
distributed. Its PDF has the form 


f(z) ={ 


c ifa<a<b, 
0 otherwise, 


where c is a constant. This is the continuous analog of the discrete uniform random 
variable discussed in Chapter 2. For fx to satisfy the normalization property, we 
must have (cf. Fig. 3.3) 


b b 
i= | cde =e [ dz = c(b—a), 


so that 


Figure 3.3: The PDF of a uniform 
random variable. 


Note that the probability P(X € J) that X takes value in a set J is 


pxen= [ 7 | dn — Weneth of [a,b] 01 
[a,b]T b—a b-a [a,b]nI length of [a,b] ~ 


The uniform random variable bears a relation to the discrete uniform law, which 
involves a sample space with a finite number of equally likely outcomes. The dif- 
ference is that to obtain the probability of various events, we must now calculate 
the “length” of various subsets of the real line instead of counting the number of 
outcomes contained in various events. 


Example 3.2. Piecewise Constant PDF. Alvin’s driving time to work is 
between 15 and 20 minutes if the day is sunny, and between 20 and 25 minutes if 
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the day is rainy, with all times being equally likely in each case. Assume that a day 
is sunny with probability 2/3 and rainy with probability 1/3. What is the PDF of 
the driving time, viewed as a random variable X? 

We interpret the statement that “all times are equally likely” in the sunny 
and the rainy cases, to mean that the PDF of X is constant in each of the intervals 
[15,20] and [20,25]. Furthermore, since these two intervals contain all possible 
driving times, the PDF should be zero everywhere else: 


ca if 15<a2< 20, 
fx(z) =< co if 20<a2 < 25, 
0 otherwise, 


where c; and cz are some constants. We can determine these constants by using 
the given probabilities of a sunny and of a rainy day: 


20 


20 
: = P(sunny day) = / fx(x) dz = i: c1 dz = 5c1, 
15 15 


1 25 25 
= = P(rainy day) = | fx(x) dz = _, co dx = 5c2, 
3 20 20 
so that 
a ae 
ne ce |. 


Generalizing this example, consider a random variable X whose PDF has the 
piecewise constant form 


_ fa ifa <2 <aizi, 1=1,2,...,n—-1, 
a= 16 otherwise, 

where a, 42,...,@n are some scalars with a; < aj41 for all i, and ci, c2,...,Cn are 

some nonnegative constants (cf. Fig. 3.4). The constants c; may be determined by 


additional problem data, as in the case of the preceding driving context. Generally, 
the c; must be such that the normalization property holds: 


n 


an n—-1 a4 —1 
iy fe(e)de => | cidx = S~ci(aiz1 — ai). 
sail Nh es i=1 


ay ao a3 ag x 


Figure 3.4: A piecewise constant PDF involving three intervals. 
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Example 3.3. A PDF can be arbitrarily large. Consider a random variable 
X with PDF 


1 : 


0 otherwise. 


Even though fx (x) becomes infinitely large as x approaches zero, this is still a valid 


PDF, because 
fore) 1 1 1 
x) dx = dx = V/x| =1. 
fe | srg te = val, 


Summary of PDF Properties 

Let X be a continuous random variable with PDF fx. 
e fx(x) > 0 for all z. 
0 [. fx(a) da =1. 
e If 6 is very small, then P([z,a + ]) © fx(z)-6. 


e For any subset B of the real line, 


P(X € B) =f fx(o dx. 


Expectation 


The expected value or mean of a continuous random variable X is defined 
byt 


E[X] = a xufx (x) da. 


—co 


t+ One has to deal with the possibility that the integral fia xfx(x) dx is infi- 
nite or undefined. More concretely, we will say that the expectation is well-defined if 
fase |x| fx (x) dx < co. In that case, it is known that the integral seen xufx(ax) dx takes 
a finite and unambiguous value. 

For an example where the expectation is not well-defined, consider a random vari- 
able X with PDF fx(x) = c/(1+ 7), where c is a constant chosen to enforce the nor- 
malization condition. The expression |z| fx (x) is approximately the same as 1/|2| when 
|x| is large. Using the fact J, G/2) dx = oo, one can show that en |x| fx (a) dx = oo. 
Thus, E[X] is left undefined, despite the symmetry of the PDF around zero. 

Throughout this book, in lack of an indication to the contrary, we implicitly 
assume that the expected value of the random variables of interest is well-defined. 
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This is similar to the discrete case except that the PMF is replaced by the 
PDF, and summation is replaced by integration. As in Chapter 2, E[X] can be 
interpreted as the “center of gravity” of the probability law and, also, as the 
anticipated average value of X in a large number of independent repetitions of 
the experiment. Its mathematical properties are similar to the discrete case — 
after all, an integral is just a limiting form of a sum. 

If X is a continuous random variable with given PDF, any real-valued 
function Y = g(X) of X is also a random variable. Note that Y can be a 
continuous random variable: for example, consider the trivial case where Y = 
g(X) = X. But Y can also turn out to be discrete. For example, suppose that 
g(x) = 1 for x > 0, and g(x) = 0, otherwise. Then Y = g(X) is a discrete 
random variable. In either case, the mean of g(X) satisfies the expected value 
rule 


E(x] = [ glee Onde 


—oCo 


in complete analogy with the discrete case. 

The nth moment of a continuous random variable X is defined as E[X”], 
the expected value of the random variable X”. The variance, denoted by 
var(X), is defined as the expected value of the random variable (X — ELX Ne 

We now summarize this discussion and list a number of additional facts 
that are practically identical to their discrete counterparts. 


Expectation of a Continuous Random Variable and its Properties 
Let X be a continuous random variable with PDF fx. 


e The expectation of X is defined by 


E[X] = The ufx (ax) da. 


—co 


e The expected value rule for a function g(X) has the form 


Bla(X)] = faa) fx(e) ae. 


—co 


e The variance of X is defined by 


var(X) = E[(X — E[X])’] = i (x — E[X])” fx (x) de. 
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e We have ; 
0 < var(X) = E[X2] — (E[X]) : 


e If Y =aX +5, where a and 0 are given scalars, then 


E[Y] = aE[X] +6, var(Y) = a?var(X). 


Example 3.4. Mean and Variance of the Uniform Random Variable. 
Consider the case of a uniform PDF over an interval [a,b], as in Example 3.1. We 
have 


E[X] -f- afx(x) dx 


b 
1 
= : d 
[ema 


oe! 1:5)? 
~ ba 2 a 
ee! b? — a? 
~ ba. 2 
_ a+b 
ae 


as one expects based on the symmetry of the PDF around (a + b)/2. 
To obtain the variance, we first calculate the second moment. We have 


E[X?] = "2 
> , o-a 


b 
1 
=sa | x dx 


var(X) = E[X?] — (E[X])” = a save es) 7 ona) 


after some calculation. 
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Suppose now that {a, b] = [0,1], and consider the function g(x) = lifa < 1/3, 
and g(x) = 2 if « > 1/3. The random variable Y = g(X) is a discrete one with 
PMF py(1) = P(X < 1/3) = 1/3, py (2) = 1— py (1) = 2/3. Thus, 


The same result could be obtained using the expected value rule: 


1 1/3 1 5 
By]= [ a(e) fete) ae = f ans [ 2d = 3. 
) ) 1/3 


Exponential Random Variable 


An exponential random variable has a PDF of the form 


Aer" if x > 0, 
fx(@) = . otherwise, 


where 4 is a positive parameter characterizing the PDF (see Fig. 3.5). This is a 
legitimate PDF because 


=1. 


/ fx(e) de = [ Aer dx = —e~r# 
60) 0 0 
Note that the probability that X exceeds a certain value falls exponentially. 
Indeed, for any a > 0, we have 


Co 


P(X >a)= / Nee da = —e>2|— 


=e, 


a a 


An exponential random variable can be a very good model for the amount 
of time until a piece of equipment breaks down, until a light bulb burns out, 
or until an accident occurs. It will play a major role in our study of random 
processes in Chapter 5, but for the time being we will simply view it as an 
example of a random variable that is fairly tractable analytically. 


d Small A Large A 


0 x 0 Xx 


Figure 3.5: The PDF \e~* of an exponential random variable. 
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The mean and the variance can be calculated to be 


1 


E(x] = 5, var(X) = 5cE 


These formulas can be verified by straightforward calculation, as we now show. 
We have, using integration by parts, 


Bx] = [ xre—r* dx 
0 


= (—are7>*) +f e-Ax dx 
0 0 
e— Ax 
=> 0 — 
0) 
enh 
SoA 


Using again integration by parts, the second moment is 


BLx] = [ x? re—r* dx 
0 


= (—22e7>*) a +f 2xe—r* dx 
0 

=0+ 2 BLX] 

7 r 

_ 2 

~ 2 


Finally, using the formula var(X) = E[X2] — (ELX ie we obtain 


Example 3.5. The time until a small meteorite first lands anywhere in the Sahara 
desert is modeled as an exponential random variable with a mean of 10 days. The 
time is currently midnight. What is the probability that a meteorite first lands 
some time between 6am and 6pm of the first day? 

Let X be the time elapsed until the event of interest, measured in days. 
Then, X is exponential, with mean 1/A = 10, which yields X = 1/10. The desired 
probability is 


P(1/4 < X < 3/4) =P(X > 1/4) — P(X > 3/4) =e 1! — ce 9/* = 0.0476, 


where we have used the formula P(X >a) = P(X >a) =e". 
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Let us also derive an expression for the probability that the time when a 
meteorite first lands will be between 6am and 6pm of some day. For the kth day, 
this set of times corresponds to the event k — (3/4) < X < k— (1/4). Since these 
events are disjoint, the probability of interest is 


Sop(e-2<x<k-1)=¥ (P(x 2%-3)-P(x>%-1)) 


4 


_ S (eee -, eg Di) 


We omit the remainder of the calculation, which involves using the geometric series 
formula. 


3.2; CUMULATIVE DISTRIBUTION FUNCTIONS 


We have been dealing with discrete and continuous random variables in a some- 
what different manner, using PMF's and PDFs, respectively. It would be desir- 
able to describe all kinds of random variables with a single mathematical concept. 
This is accomplished by the cumulative distribution function, or CDF for 
short. The CDF of a random variable X is denoted by Fx and provides the 
probability P(X < x). In particular, for every x we have 


S- px(k) X: discrete, 


k<au 


/ fx (t) dt X: continuous. 


Loosely speaking, the CDF Fx (a) “accumulates” probability “up to” the value z. 

Any random variable associated with a given probability model has a CDF, 
regardless of whether it is discrete, continuous, or other. This is because {X < x} 
is always an event and therefore has a well-defined probability. Figures 3.6 and 
3.7 illustrate the CDFs of various discrete and continuous random variables. 


From these figures, as well as from the definition, some general properties of the 
CDF can be observed. 
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Figure 3.6: CDFs of some discrete random variables. The CDF is related to the 
PMF through the formula 


Fx (x) =P(X <2) =) px(h), 


k<au 


and has a staircase form, with jumps occurring at the values of positive probability 
mass. Note that at the points where a jump occurs, the value of F’x is the larger 
of the two corresponding values (i.e., fx is continuous from the right). 


Properties of a CDF 
The CDF Fx of a random variable X is defined by 


Fx(x)=P(X <2), for all x, 


and has the following properties. 


e Fx is monotonically nondecreasing: 


ifa<y, then Fx(x) < Fx(y). 


e Fx(a) tends to 0 as > —ov, and to 1 as 4% > oo. 


e If X is discrete, then Fx has a piecewise constant and staircase-like 
form. 


e If X is continuous, then Fx has a continuously varying form. 
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e If X is discrete and takes integer values, the PMF and the CDF can 
be obtained from each other by summing or differencing: 


k 


px(k) = P(X <b) - P(X <k-1) = Fx(k) — Fx(k- 0, 


for all integers k. 


e If X is continuous, the PDF and the CDF can be obtained from each 
other by integration or differentiation: 


Fx(x) = i. fx (t) dt, 


_ dFx 


fx(x) = Ee 


(The latter relation is valid for those x for which the CDF has a deriva- 
tive.) 


Because the CDF is defined for any type of random variable, it provides 
a convenient means for exploring the relations between continuous and discrete 
random variables. This is illustrated in the following example, which shows 
that there is a close relation between the geometric and the exponential random 
variables. 


Example 3.6. The Geometric and Exponential CDFs. Let X bea geometric 
random variable with parameter p; that is, X is the number of trials to obtain the 
first success in a sequence of independent Bernoulli trials, where the probability of 
success is p. Thus, for k = 1,2,..., we have P(X = k) = p(1— p)*~* and the CDF 
is given by 


Fe°(n) = wre p) = pa =1-(1-p)", forn = 1,2,... 


Suppose now that X is an exponential random variable with parameter > 0. 
Its CDF is given by 


F@?(z) = P(X <x) =0, for « <0, 


BP (gg) / he “dt =-e “| =1-e°**, for x > 0. 
) 
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CDF F(x) 


Figure 3.7: CDFs of some continuous random variables. The CDF is related to 
the PDF through the formula 


Fx(a) = P(X <2) = [ fx (t) dt. 


Thus, the PDF fx can be obtained from the CDF by differentiation: 


fix(o) = EXO) 


For a continuous random variable, the CDF has no jumps, i.e., it is continuous. 


To compare the two CDFs above, let 6 = —In(1 — p)/A, so that 


Then we see that the values of the exponential and the geometric CDF are equal 
for all « = nd, where n = 1,2,..., ie., 


F°? (nd) = F®°(n), = n=1,2,..., 


as illustrated in Fig. 3.8. 

If 6 is very small, there is close proximity of the exponential and the geometric 
CDFs, provided that we scale the values taken by the geometric random variable by 
6. This relation is best interpreted by viewing X as time, either continuous, in the 
case of the exponential, or 6-discretized, in the case of the geometric. In particular, 
suppose that 6 is a small number, and that every 6 seconds, we flip a coin with the 
probability of heads being a small number p. Then, the time of the first occurrence 
of heads is well approximated by an exponential random variable. The parameter 
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Exponential CDF 1 - e4* 


Geometric CDF 
1-(1-p)"withp=1-e 


16 


Figure 3.8: Relation of the geometric and the exponential CDFs. We have 


’ 


FeP (ng) = FE°(n),  n=1,2,... 


if the interval 5 is such that e~*® = 1— p. As 6 approaches 0, the exponential 
random variable can be interpreted as the “limit” of the geometric. 


d of this exponential is such that e~*® = 1— p or \ = —In(1— p)/6. This relation 
between the geometric and the exponential random variables will play an important 
role in the theory of the Bernoulli and Poisson stochastic processes in Chapter 5. 


Sometimes, in order to calculate the PMF or PDF of a discrete or contin- 
uous random variable, respectively, it is more convenient to first calculate the 
CDF and then use the preceding relations. The systematic use of this approach 
for the case of a continuous random variable will be discussed in Section 3.6. 
The following is a discrete example. 


Example 3.7. The Maximum of Several Random Variables. You are 
allowed to take a certain test three times, and your final score will be the maximum 
of the test scores. Thus, 


X = max{X1, X2, X3}, 


where X,, X2, X3 are the three test scores and X is the final score. Assume that 
your score in each test takes one of the values from 1 to 10 with equal probability 
1/10, independently of the scores in other tests. What is the PMF px of the final 
score? 

We calculate the PMF indirectly. We first compute the CDF F'x(k) and then 
obtain the PMF as 


px(k) = Fx(k) — Fx(k—1), k=1,...,10. 
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We have 
Fx (k) 


P(X <k) 
(Xi <k, Xo <k, X3<k) 
(X1 < k)P(X2 < k)P(X3 < k) 


P 
k 3 
= (35) » 
where the third equality follows from the independence of the events {Xi < k}, 
{Xo < k}, {X3 < k}. Thus the PMF is given by 


k\? k—-1\3 
perth) (=) (=z Ve RAT con, 


3.3. NORMAL RANDOM VARIABLES 


A continuous random variable X is said to be normal or Gaussian if it has a 
PDF of the form (see Fig. 3.9) 


1 
fx(z) = «J One. 


e~(@—H)?/207 


where p and o are two scalar parameters characterizing the PDF, with o assumed 
nonnegative. It can be verified that the normalization property 


1 = 2 jo 2 
e7 (zB) /20° dy = 1 
V2 o iz 


holds (see the theoretical problems). 


Normal PDF f,(x) Normal CDF F(x) 


Figure 3.9: A normal PDF and CDF, with u = 1 and o? = 1. We observe that 
the PDF is symmetric around its mean ss, and has a characteristic bell-shape. 
As x gets further from yp, the term e7(t—#)?/207 decreases very rapidly. In this 
figure, the PDF is very close to zero outside the interval [—1, 3]. 
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The mean and the variance can be calculated to be 
E[X] = p, var(X) = 07. 


To see this, note that the PDF is symmetric around p, so its mean must be w. 
Furthermore, the variance is given by 


1 co 
var(X) = cael (x — p)2e-(@-H)*/20” der, 


Using the change of variables y = (a — )/o and integration by parts, we have 


0 2e—y?/2 
var(X) = Jan yre—¥ /? dy 

> ie (- ev2) 7 4 f evrg 
20 : ~c0 = V 27 S00 f 
o2 ~ 2 

— ey /2 q 
AS 27 db &s 4 

=o". 


The last equality above is obtained by using the fact 


=| e-¥/2 dy i 
V2 Joo : 
which is just the normalization property of the normal PDF for the case where 
u=Oando=1. 

The normal random variable has several special properties. The following 
one is particularly important and will be justified in Section 3.6. 


Normality is Preserved by Linear Transformations 


If X is a normal random variable with mean p and variance o?, and if a, b 
are scalars, then the random variable 


Y=aX +b 
is also normal, with mean and variance 


E[Y] = au +b, var(Y) = a?o?. 
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The Standard Normal Random Variable 


A normal random variable Y with zero mean and unit variance is said to be a 
standard normal. Its CDF is denoted by ®, 


Oy) = P(Y <y)=P(Y <y)= a [ at’ /2 de. 


It is recorded in a table (given in the next page), and is a very useful tool 
for calculating various probabilities involving normal random variables; see also 
Fig. 3.10. 

Note that the table only provides the values of ®(y) for y > 0, because the 
omitted values can be found using the symmetry of the PDF. For example, if Y 
is a standard normal random variable, we have 


®(—0.5) = P(Y < -0.5) = P(Y >0.5) =1-—P(Y < 0.5) 
= 1— 6(0.5) = 1 — .6915 = 0.3085. 
Let X be a normal random variable with mean jz and variance o?. We 
“standardize” X by defining a new random variable Y given by 


XK = 
y=-—*. 


oO 


Since Y is a linear transformation of X, it is normal. Furthermore, 


=, 


Bee Cee 


oO o2 


Thus, Y is a standard normal random variable. This fact allows us to calculate 
the probability of any event defined in terms of X: we redefine the event in terms 
of Y, and then use the standard normal table. 


Standard Normal PDF Standard Normal CDF 


Figure 3.10: The PDF 


—_1 .-y?/2 
fy(y) = Gan 


of the standard normal random variable. Its corresponding CDF, which is denoted 
by ®(y), is recorded in a table. 
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Example 3.8. Using the Normal Table. The annual snowfall at a particular 
geographic location is modeled as a normal random variable with a mean of 4 = 60 
inches, and a standard deviation of o = 20. What is the probability that this year’s 
snowfall will be at least 80 inches? 

Let X be the snow accumulation, viewed as a normal random variable, and 
let 


X-p X-60 


Y 
oO 20’ 


be the corresponding standard normal random variable. We want to find 


X — 60, 80 — 60 
20 ~ 20 


P(X > 80) = P( )=P(y 295") =pw 21) = 1-90), 
where ® is the CDF of the standard normal. We read the value ®(1) from the table: 
®(1) = 0.8413, 


so that 
P(X > 80) = 1— ®(1) = 0.1587. 


Generalizing the approach in the preceding example, we have the following 
procedure. 


CDF Calculation of the Normal Random Variable 


The CDF of a normal random variable X with mean yz and variance o? is 
obtained using the standard normal table as 


rosian(Stst38)an(r tit) (3), 


(or oO 


where Y is a standard normal random variable. 


The normal random variable is often used in signal processing and com- 
munications engineering to model noise and unpredictable distortions of signals. 
The following is a typical example. 


Example 3.9. Signal Detection. A binary message is transmitted as a signal 
that is either —1 or +1. The communication channel corrupts the transmission with 
additive normal noise with mean p = 0 and variance o?. The receiver concludes 
that the signal —1 (or +1) was transmitted if the value received is < 0 (or > 0, 
respectively); see Fig. 3.11. What is the probability of error? 

An error occurs whenever —1 is transmitted and the noise N is at least 1 so 
that N+ S = N—1> 0, or whenever +1 is transmitted and the noise N is smaller 


3.4 


20 
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Normal zero-mean 
noise N 


with variance o% 


Noisy Channel 
Signal 


S=+1or-1 


+1ifN+S>0 


Receiver 


-1ifN+ S<0 


Region of error Region of error 
when a -1 is when a +1 is 
transmitted transmitted 


Figure 3.11: The signal detection scheme of Example 3.9. The area of the 
shaded region gives the probability of error in the two cases where —1 and +1 
is transmitted. 


than —1 so that N+ S = N+1 <0. In the former case, the probability of error is 


P(N >1)=1-P(N<1)=1 p(7=# < +4) 


~ o 
=1 o(—*)=1 o(<). 
0 0 
In the latter case, the probability of error is the same, by symmetry. The value 


of ®(1/c) can be obtained from the normal table. For o = 1, we have ®(1/c) = 
®(1) = 0.8413, and the probability of the error is 0.1587. 


The normal random variable plays an important role in a broad range of 


probabilistic models. The main reason is that, generally speaking, it models well 
the additive effect of many independent factors, in a variety of engineering, phys- 
ical, and statistical contexts. Mathematically, the key fact is that the sum of a 
large number of independent and identically distributed (not necessarily normal) 
random variables has an approximately normal CDF, regardless of the CDF of 
the individual random variables. This property is captured in the celebrated 
central limit theorem, which will be discussed in Chapter 7. 


CONDITIONING ON AN EVENT 


The conditional PDF of a continuous random variable X, conditioned on a 


particular event A with P(A) > 0, is a function fx\4 that satisfies 


P(X € B| A) = i fxja(e) de, 
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for any subset B of the real line. It is the same as an ordinary PDF, except that 
it now refers to a new universe in which the event A is known to have occurred. 

An important special case arises when we condition on X belonging to a 
subset A of the real line, with P(X € A) > 0. We then have 


P(Xe€BandXe€A) Jaap fx (a) dx 
P(X € A) = “P(X eA) 


P(X €B|XeA)= 


This formula must agree with the earlier one, and therefore, | 


fx(z) 
fxja(x| A) = P(X € A) 


0 otherwise. 


ifxeA, 


As in the discrete case, the conditional PDF is zero outside the conditioning 
set. Within the conditioning set, the conditional PDF has exactly the same 
shape as the unconditional one, except that it is scaled by the constant factor 
1/P(X ¢€ A). This normalization ensures that fx), integrates to 1, which makes 
it a legitimate PDF; see Fig. 3.13. 


FX/A(X) F(x) 


| 


Figure 3.13: The unconditional PDF fx and the conditional PDF fx); 4, where 
A is the interval [a,b]. Note that within the conditioning event A, fx |a retains 
the same shape as fx, except that it is scaled along the vertical axis. 


Example 3.10. The exponential random variable is memoryless. Alvin 
goes to a bus stop where the time T’' between two successive buses has an exponential 
PDF with parameter \. Suppose that Alvin arrives t secs after the preceding bus 
arrival and let us express this fact with the event A = {T >t}. Let X be the time 
that Alvin has to wait for the next bus to arrive. What is the conditional CDF 
Fx a(x | A)? 


+ We are using here the simpler notation fx),(x) in place of fx;xea4, which is 


more accurate. 
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We have 


P(X >a2|A)=P(T>t+a|T >t) 
P(T >t+2 and T > ft) 
a P(T >t) 
P( >t+2) 
P(T >t) 
ea Mt+2) 


eot 
es eo. 
where we have used the expression for the CDF of an exponential random variable 
derived in Example 3.6. 

Thus, the conditional CDF of X is exponential with parameter A, regardless 
the time t that elapsed between the preceding bus arrival and Alvin’s arrival. This 
is known as the memorylessness property of the exponential. Generally, if we model 
the time to complete a certain operation by an exponential random variable X, 
this property implies that as long as the operation has not been completed, the 
remaining time up to completion has the same exponential CDF, no matter when 
the operation started. 


For a continuous random variable, the conditional expectation is defined 
similar to the unconditional case, except that we now need to use the condi- 
tional PDF. We summarize the discussion so far, together with some additional 
properties in the table that follows. 


Conditional PDF and Expectation Given an Event 


e The conditional PDF fx), of a continuous random variable X given 
an event A with P(A) > 0, satisfies 


P(X € B| A) =f fxalo) dx. 


e If A be a subset of the real line with P(X € A) > 0, then 


fx(x) 
fxja(@) = 4 P(X € A) ESPs 
0 otherwise, 


and 
P(X = B\X € A) = ff fxyalo) ae. 


for any set B. 
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e The corresponding conditional expectation is defined by 
E[X|A]= f  afxja(a)de. 
e The expected value rule remains valid: 


B(x) 4] =f ola)fxiala) ae, 


—oco 


e If Ay, Ao,...,An are disjoint events with P(A;) > 0 for each i, that 
form a partition of the sample space, then 


fx(x) = a P(Ai) fxja, (x) 


(a version of the total probability theorem), and 


E[X] = )) P(A)E[X | Ai] 


i=1 


(the total expectation theorem). Similarly, 


= EP(A)E[9(X) | Ai]. 


To justify the above version of the total probability theorem, we use the 
total probability theorem from Chapter 1, to obtain 


P(X < 2) =P P(X < x| Ai). 


This formula can be rewritten as 


ic fe(tyat= Pay | fxya,(t) at 


We take the derivative of both sides, with respect to x, and obtain the desired 


relation 
=>oP(A i) fxa;(2)- 
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If we now multiply both sides by x and then integrate from —co to ov, we obtain 
the total expectation theorem for continuous random variables. 

The total expectation theorem can often facilitate the calculation of the 
mean, variance, and other moments of a random variable, using a divide-and- 
conquer approach. 


Example 3.11. Mean and Variance of a Piecewise Constant PDF. Suppose 
that the random variable X has the piecewise constant PDF 


1/3 if0<a2<1, 
fx(x) = § 2/3 ifl<a<2, 
0 otherwise, 
(see Fig. 3.14). Consider the events 


A, = {X lies in the first interval [0, ij}, 
A2 = {X lies in the second interval (1, al}. 


We have from the given PDF, 


P(A) =f fxte)ae = 5. P(4a) =f fe(e) de = 3 
0 1 


Furthermore, the conditional mean and second moment of X, conditioned on A, 
and Ag, are easily calculated since the corresponding conditional PDFs fx;4, and 
fx| Ag are uniform. We recall from Example 3.4 that the mean of a uniform random 
variable on an interval [a,b] is (a+ b)/2 and its second moment is (a? + ab + b?)/3. 
Thus, 


1 3 
E[X|Ai]= 5, E[X| 2] = 5, 
2 1 2 7 
E[X?|4:1]=3,  E[X?| Ao] = 3. 
BAX) 
2/3 

Figure 3.14: Piecewise con- 
1/3 stant PDF for Example 3.11. 
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We now use the total expectation theorem to obtain 


E[X] = P(A,)ELX | Ai] + P(As)ELX | Ao] = ; ’ ; 


E[X?] = P(A1)E[X* | Ai] + P(A2)E[X? | Ao] = 


11 
8-3 


win 
ee 
ot 


2 
3 
The variance is given by 


2 15 49 = Ii 
X) = E[X?] - (E[X]) = —-= ==. 
var(X) = B[X*] - (E[X])’ = 2- T= 3 
Note that this approach to the mean and variance calculation is easily generalized 
to piecewise constant PDFs with more than two pieces. 


The next example illustrates a divide-and-conquer approach that uses the 
total probability theorem to calculate a PDF. 


Example 3.12. The metro train arrives at the station near your home every 
quarter hour starting at 6:00 AM. You walk into the station every morning between 
7:10 and 7:30 AM, with the time in this interval being a uniform random variable. 
What is the PDF of the time you have to wait for the first train to arrive? 


fyx) fy/A(y) 


1/5 


5 y 
() 
fYBY) fy) 
1/10 
1/15 1/20 
15 y 5 15 v 


(d) 


Figure 3.15: The PDF's fx, fy\a, fy|B, and fy in Example 3.12. 


The time of your arrival, denoted by X, is a uniform random variable on the 
interval from 7:10 to 7:30; see Fig. 3.15(a). Let Y be the waiting time. We calculate 
the PDF fy using a divide-and-conquer strategy. Let A and B be the events 


A = {7:10 < X < 7:15} = {you board the 7:15 train}, 
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B= {7:15 < X < 7:30} = {you board the 7:30 train}. 


Conditioned on the event A, your arrival time is uniform on the interval from 7:10 
to 7:15. In that case, the waiting time Y is also uniform and takes values between 
0 and 5 minutes; see Fig. 3.15(b). Similarly, conditioned on B, Y is uniform and 
takes values between 0 and 15 minutes; see Fig. 3.15(c). The PDF of Y is obtained 
using the total probability theorem, 


fy(y) = P(A) fyja(y) + P(B) fyie(y), 


and is shown in Fig. 3.15(d). In particular, 


fk et A 
= ; t " — ’ < < ’ 
POT stgptig frosyss 
and 1 3 oh 
rt ae ; <15. 
fy (y) Z 0 ae. 50 for5<y<15 


3.5 MULTIPLE CONTINUOUS RANDOM VARIABLES 


We will now extend the notion of a PDF to the case of multiple random vari- 
ables. In complete analogy with discrete random variables, we introduce joint, 
marginal, and conditional PDFs. Their intuitive interpretation as well as their 
main properties parallel the discrete case. 

We say that two continuous random variables associated with a common 
experiment are jointly continuous and can be described in terms of a joint 
PDF fxy, if fx,y is a nonnegative function that satisfies 


P((X,Y) € )= f ftrr@y ) de dy, 


(x, y)EB 


for every subset B of the two-dimensional plane. The notation above means 
that the integration is carried over the set B. In the particular case where B is 
a rectangle of the form B = [a, }] x [c,d], we have 


Pia< X <b,c<Y<d= [Pf seve dees 


Furthermore, by letting B be the entire two-dimensional plane, we obtain the 


normalization property 
lo e) lo) 
| f_ fxxewdeay=1. 
—oo J—0o 
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To interpret the PDF, we let 6 be very small and consider the probability 
of a small rectangle. We have 


c+é até 
Plas X <atéc<¥<cr8)= | , Ixy (e,y) de dy © fxy(a,0)-8, 


‘ 


so we can view fx y(a,c) as the “probability per unit area” in the vicinity of 
(a,c). 

The joint PDF contains all conceivable probabilistic information on the 
random variables X and Y, as well as their dependencies. It allows us to calculate 
the probability of any event that can be defined in terms of these two random 
variables. As a special case, it can be used to calculate the probability of an 
event involving only one of them. For example, let A be a subset of the real line 


and consider the event {X € A}. We have 
P(X € A) =P(X € Aand Y € (00, 0)) = ea fx.y (a, y) dy dex. 
Comparing with the formula 
P(X € A) =f tx@ dx, 


we see that the marginal PDF fx of X is given by 
fx(x) = / fx,y (2, y) dy. 


Similarly, 
fr) =f fx.y (a, y) da. 


Example 3.13. Two-Dimensional Uniform PDF. Romeo and Juliet have a 
date at a given time, and each will arrive at the meeting place with a delay between 
0 and 1 hour (recall the example given in Section 1.2). Let X and Y denote the 
delays of Romeo and Juliet, respectively. Assuming that no pairs (x,y) in the 
square [0,1] x [0,1] are more likely than others, a natural model involves a joint 
PDF of the form 


ce if0<a<landO0O<y<l, 
0 otherwise, 


fxy (x,y) = { 


where c is a constant. For this PDF to satisfy the normalization property 


i) / fv (es)dedy = | i) cdx dy = 1, 
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we must have 
é= 1, 


This is an example of a uniform PDF on the unit square. More generally, 


let us fix some subset S of the two-dimensional plane. The corresponding uniform 
joint PDF on S is defined to be 


1 
——— _ if S 
fx,y(a,y) = { area of S bey) Ee, 
0 otherwise. 


For any set A C S, the probability that the experimental value of (X,Y) lies in A 
is 


area of ANS’ 
P((XY)€ A)= ff txy(ewy) av dy = mos | [ ew ~areaof SS” 


(x,y)EA (z,y)EANS 


Example 3.14. We are told that the joint PDF of the random variables X and Y 
is a constant c on the set S shown in Fig. 3.16 and is zero outside. Find the value 
of c and the marginal PDF's of X and Y. 

The area of the set S is equal to 4 and, therefore, fx,y(x,y) = c = 1/4, for 
(z,y) € S. To find the marginal PDF fx(a) for some particular x, we integrate 
(with respect to y) the joint PDF over the vertical line corresponding to that 2. 
The resulting PDF is shown in the figure. We can compute fy similarly. 


y 
A 
4 
3 
S 
12 z 
1 
1/4 
1 2. 3 
MY) ot 
3/4 
f(x) 4 
i oa 
x 


Figure 3.16: The joint PDF in Example 3.14 and the resulting marginal 
PDFs. 
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Example 3.15. Buffon’s Needle. This is a famous example, which marks 
the origin of the subject of geometrical probability, that is, the analysis of the 
geometrical configuration of randomly placed objects. 

A surface is ruled with parallel lines, which are at distance d from each other 
(see Fig. 3.17). Suppose that we throw a needle of length / on the surface at random. 
What is the probability that the needle will intersect one of the lines? 


Figure 3.17: Buffon’s needle. The 
length of the line segment between the 
7 midpoint of the needle and the point 


0 of intersection of the axis of the needle 


x with the closest parallel line is x/ sin 0. 
1 The needle will intersect the closest par- 
allel line if and only if this length is less 
than 1/2. 


We assume here that 1 < dso that the needle cannot intersect two lines 
simultaneously. Let X be the distance from the midpoint of the needle to the 
nearest of the parallel lines, and let O be the acute angle formed by the axis of the 
needle and the parallel lines (see Fig. 3.17). We model the pair of random variables 
(X, ©) with a uniform joint PDF over the rectangle [0,d/2] x [0, 7/2], so that 


ested) = { Aiea) ifxe [0, d/2] and 6 € [0,7/2], 
0 otherwise. 
As can be seen from Fig. 3.17, the needle will intersect one of the lines if and 
only if 
xX< sin QO, 


so the probability of intersection is 


P(X < (1/2)sin®) = I/ fx,e(x, 0) dx dO 


a<(l/2) sin @ 


4 n/2 (1/2) sin @ 

= “a / dx dé 
md Jy 5 
Oe ian 


= — ~ sin @ dé 
ai 5 ain 


= = (— cos 0) 

_ 2l 

a 
The probability of intersection can be empirically estimated, by repeating the ex- 
periment a large number of times. Since it is equal to 2l1/md, this provides us with 
a method for the experimental evaluation of 7. 


n/2 


0 
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Expectation 


If X and Y are jointly continuous random variables, and g is some function, then 
Z = g(X,Y) is also a random variable. We will see in Section 3.6 methods for 
computing the PDF of Z, if it has one. For now, let us note that the expected 
value rule is still applicable and 


E[9(X,Y)| -[- a g(x,y) fx,y (x,y) da dy. 


As an important special case, for any scalars a, b, we have 


E[aX + bY] = aB[X] + bE[Y]. 


Conditioning One Random Variable on Another 


Let X and Y be continuous random variables with joint PDF fx,y. For any 
fixed y with fy(y) > 0, the conditional PDF of X given that Y = y, is defined 
a (e.¥) 
fxy(@y 

This definition is analogous to the formula px;y = px,y /py for the discrete case. 

When thinking about the conditional PDF, it is best to view y as a fixed 
number and consider fx)y(x|y) as a function of the single variable x. As a 
function of x, the conditional PDF fx)y(x|y) has the same shape as the joint 
PDF fx,y(z,y), because the normalizing factor fy(y) does not depend on 2; see 
Fig. 3.18. Note that the normalization ensures that 


/ fxiy(z|y) dx = 1, 


so for any fixed y, fxjy(x|y) is a legitimate PDF. 


> fy) (x13.5) 
3 1/2 tyyxi2.5) * 
— ——_, 
2 1 fyyyixit.5) * 
1 1 2 3 
i <2). 3 = 
x 


Figure 3.18: Visualization of the conditional PDF fx|y(#|y). Let X,Y have a 
joint PDF which is uniform on the set S. For each fixed y, we consider the joint 
PDF along the slice Y = y and normalize it so that it integrates to 1. 
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Example 3.16. Circular Uniform PDF. John throws a dart at a circular 
target of radius r (see Fig. 3.19). We assume that he always hits the target, and 
that all points of impact (x,y) are equally likely, so that the joint PDF of the 
random variables X and Y is uniform. Following Example 3.13, and since the area 
of the circle is 7r?, we have 


1 


cn { aac tia. (x, y) is in the circle, 
0 


otherwise, 


1 
-{e ifar+y? <r’, 


T 
0 otherwise. 


Figure 3.19: Circular target for 
Example 3.16. 


To calculate the conditional PDF fx\y (2 | y), let us first calculate the marginal 
PDF fy(y). For |y| > r, it is zero. For |y| <1, it can be calculated as follows: 


few = f fx,y (a, y) da 
1 


ae a2fy2<r2 
r2—y2 

1 
Tr _ pray? 

2 
= 2 2 
on ee 


Note that the marginal fy (y) is not a uniform PDF. 
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The conditional PDF is 


fxy (x,y) 


fxiv(«|y) = fy(y) 


Thus, for a fixed value of y, the conditional PDF fx,y is uniform. 


To interpret the conditional PDF, let us fix some small positive numbers 
61 and 62, and condition on the event B= {y < Y < y+ 62}. We have 
P(a< X <ax44+06, andy <Y < y+ 62) 
P(y< Y <y+t 62) 


x, Yy)616: 
& eae = fxiy(@|y)b1. 


Pia<X<at+d|y<Y <y+t62)= 


In words, fxjy(#|y)o1 provides us with the probability that X belongs in a 
small interval [a,x + 61], given that Y belongs in a small interval [y, y + 69]. 
Since fx;y(x|y)o1 does not depend on 62, we can think of the limiting case 
where 62 decreases to zero and write 


P(e <X<e+hi|Y=y) © fxy(elyd, (dr small), 


and, more generally, 
P(X € A|Y =y) =) fxiy (x | y) de. 
A 


Conditional probabilities, given the zero probability event {Y = y}, were left 
undefined in Chapter 1. But the above formula provides a natural way of defining 
such conditional probabilities in the present context. In addition, it allows us to 
view the conditional PDF fx)y(a|y) (as a function of x) as a description of the 
probability law of X, given that the event {Y = y} has occurred. 

As in the discrete case, the conditional PDF fx \y, together with the 
marginal PDF fy are sometimes used to calculate the joint PDF. Furthermore, 
this approach can be also used for modeling: instead of directly specifying fx.y, 
it is often natural to provide a probability law for Y, in terms of a PDF fy, and 
then provide a conditional probability law fx;y(a,y) for X, given any possible 
value y of Y. 


Example 3.17. Let X be exponentially distributed with mean 1. Once we 
observe the experimental value x of X, we generate a normal random variable Y 
with zero mean and variance x + 1. What is the joint PDF of X and Y? 
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We have fx(x) =e”, for x > 0, and 


2 
fy|x(y| 2) = — fey /2(@+1)_ 
2n(x + 1) 
Thus, 
a : —y? /2(a+1) 
fx,y (a, y) = fx(x)fy|x(y| xz) =e - 
Qn(x + 1) 


for all x > 0 and all y. 


Having defined a conditional probability law, we can also define a corre- 
sponding conditional expectation by letting 


Co 


BIX|Y =a = [ efea ae. 


—co 


The properties of (unconditional) expectation carry though, with the obvious 
modifications, to conditional expectation. For example the conditional version 
of the expected value rule 


Co 


Elo(X)|¥ =a) = | oe) fay (a |) de 


—Co 


remains valid. 


Summary of Facts About Multiple Continuous Random Variables 
Let X and Y be jointly continuous random variables with joint PDF fx y. 
e The joint, marginal, and conditional PDFs are related to each other 


by the formulas 


fxy(z,y) = fy(y) fxiy (21 y), 
fx(e) =f frlu)xiy(elyay 


The conditional PDF fx)y(«|y) is defined only for those y for which 
fy (y) > 0. 
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e They can be used to calculate probabilities: 


P((X,Y) € B) = / i fay (x,y) dx dy, 


(z,y)EB 


P(X ¢A)= [tx de, 
P(XEAIY =) =f fey(ely) ae. 
° They can also be used to calculate expectations: 
Elo(X)] = f ole) fx(e) ae, 
E[g(X,Y)|] = | [oe y) fx.y (a, y) da dy, 
Blo(x)|Y¥ =u] = f ole)fury oly) ax, 
Bla XY) |¥ =u] = f ole.w)fxw(ely) ae. 
e We have the following versions of the total expectation theorem: 
BX] = f BLX|¥ =ylfv(u)av, 
B[a(X)] = f Blg(X)|¥ =alfvu)av, 


E[9(X,Y)] = / E[o(X,Y)|¥ = yl fy (y) dy. 


To justify the first version of the total expectation theorem, we observe 
that 


freiv-anwa- | | [efavely ae| fey) dy 
=f f etslelw fru) aedy 


& | [ete acd 
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Shea | | fxr) ay| dv 


= / ufx(x) dx 
= EX]. 
The other two versions are justified similarly. 


Inference and the Continuous Bayes’ Rule 


In many situations, we have a model of an underlying but unobserved phe- 
nomenon, represented by a random variable X with PDF fx, and we make 
noisy measurements Y. The measurements are supposed to provide information 
about X and are modeled in terms of a conditional PDF fy;x. For example, if 
Y is the same as X, but corrupted by zero-mean normally distributed noise, one 
would let the conditional PDF fy;x(y|a) of Y, given that X = 2, be normal 
with mean equal to x. Once the experimental value of Y is measured, what 
information does this provide on the unknown value of X? 

This setting is similar to that encountered in Section 1.4, when we intro- 
duced the Bayes rule and used it to solve inference problems. The only difference 
is that we are now dealing with continuous random variables. 

Note that the information provided by the event {Y = y} is described by 
the conditional PDF fx|y(x|y). It thus suffices to evaluate the latter PDF. A 
calculation analogous to the original derivation of the Bayes’ rule, based on the 
formulas fx fy|x = fx,y = fy fxty, yields 


fxy(aly) = fx (x) fy|x(y|2) 2 fx (x) fy|x(y| 2) 
ve) | fx@frixtul bet 


which is the desired formula. 


Example 3.18. A lightbulb produced by the General Illumination Company is 
known to have an exponentially distributed lifetime Y. However, the company has 
been experiencing quality control problems. On any given day, the parameter X of 
the PDF of Y is actually a random variable, uniformly distributed in the interval 
[0,1/2]. We test a lightbulb and record the experimental value y of its lifetime. 
What can we say about the underlying parameter ? 

We model the parameter as a random variable X, with a uniform distri- 
bution. All available information about X is contained in the conditional PDF 
fx\x(«|y). We view y as a constant (equal to the observed value of Y) and con- 
centrate on the dependence of the PDF on x. Note that fx(x) = 2, for0 <a < 1/2. 
By the continuous Bayes rule, we have 


2xre 74 


oT ae forO<a< 
fi? 2te-tat 


fxiy (aly) = 
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In some cases, the unobserved phenomenon is inherently discrete. For 
example, if a binary signal is observed in the presence of noise with a normal 
distribution. Or if a medical diagnosis is to be made on the basis of continuous 
measurements like temperature and blood counts. In such cases, a somewhat 
different version of Bayes’ rule applies. 

Let X be a discrete random variable that takes values in a finite set 
{1,...,n} and which represents the different discrete possibilities for the un- 
observed phenomenon of interest. The PMF px of X is assumed to be known. 
Let Y be a continuous random variable which, for any given value x, is described 
by a conditional PDF fy) x(y|). We are interested in the conditional PMF of 
X given the experimental value y of Y. 

Instead of working with conditioning event {Y = y} which has zero proba- 
bility, let us instead condition on the event {y < Y < y+}, where 6 is a small 
positive number, and then take the limit as 6 tends to zero. We have, using the 
Bayes rule 


P(X =2|Y =y)=P(X=c2ly<Y<y+4+6) 
_ px(2)P(y < Y¥ <y+6|X =2) 
P(y<Y <y+6) 
_ Px (2) fyix(y| x)6 
fy (y)6 
_ px(&)fy|x(y| 2) 
_ fy(y) 


The denominator can be evaluated using a version of the total probability theo- 
rem introduced in Section 3.4. We have 


fy(y) = So px (i) fy xy |2). 


i=l 


Example 3.19. Let us revisit the signal detection problem considered in 3.9. A 
signal S is transmitted and we are given that P(S = 1) = pand P(S = —1) = 1—p. 
The received signal is Y = N+, where N is zero mean normal noise, with variance 
o”, independent of S. What is the probability that S = 1, as a function of the 
observed value y of Y? 
Conditioned on S = s, the random variable Y has a normal distribution with 
mean s and variance o”. Applying the formula developed above, we obtain 
9:5 23 
ps(Dfyis(ylD mee on 


P(S=1/Y=y)= a : 
( | ¥) fy (y) gee Ge 4 Ce a 
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Independence 


In full analogy with the discrete case, we say that two continuous random vari- 
ables X and Y are independent if their joint PDF is the product of the marginal 
PDFs: 


fxy(z,y) = fx(x)fy(y), for all z,y. 


Comparing with the formula fx,y(%,y) = fxjy(«ly)fy(y), we see that inde- 
pendence is the same as the condition 


fxiy(zly) = fx(2), for all x and all y with fy(y) > 0, 
or, symmetrically, 
fyix(y|2) = fry), for all y and all « with fx(x) > 0. 


If X and Y are independent, then any two events of the form {X € A} and 
{Y © B} are independent. Indeed, 


P(X €Aand Ye B)= | 
ZEA 


/ teu aantide 

yEB 

=} fx (x) fy (y) dy de 
zEeAJyEeB 

= i fx (x) dx fy (y) dy 
rEA yEB 


= P(X € A)P(Y € B). 


A converse statement is also true; see the theoretical problems. 
A calculation similar to the discrete case shows that if X and Y are inde- 
pendent, then 


for any two functions g and h. Finally, the variance of the sum of independent 
random variables is again equal to the sum of the variances. 


38 General Random Variables Chap. 3 


Independence of Continuous Random Variables 


Suppose that X and Y are independent, that is, 


fxy(z,y) = fx(x)fy(y), for all z, y. 


We then have the following properties. 


e The random variables g(X) and h(Y) are independent, for any func- 
tions g and h. 


e We have 


e We have 
var(X + Y) = var(X) + var(Y). 


Joint CDFs 


If X and Y are two random variables associated with the same experiment, we 
define their joint CDF by 


Fxy(x,y) =P(X <2,Y <y). 


As in the case of one random variable, the advantage of working with the CDF 
is that it applies equally well to discrete and continuous random variables. In 
particular, if X and Y are described by a joint PDF fx y, then 


cry 
Pev(ty) = P(X <2, ¥ <y)= [ / fx.y(s, t) ds dt. 


Conversely, the PDF can be recovered from the PDF by differentiating: 


OP Fx y 
fxy(a,y) = Oxdy (©): 


Example 3.20. Let X and Y be described by a uniform PDF on the unit square. 
The joint CDF is given by 


Fxy(a,y)=P(X<2,Y¥<y)=ay, for0<a,y<1. 
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We then verify that 


a Fx y 
Oxdy 


for all (x, y) in the unit square. 


(2,y) = FEY (ey) =1 = fave), 


More than Two Random Variables 


The joint PDF of three random variables X, Y, and Z is defined in analogy with 
the above. For example, we have 


P((X,Y,Z) € B) = | f [tsvele2) de dy ae 


(x,y,z)EB 


for any set B. We also have relations such as 


fx,y (a, y) = f fxxeley2) dz, 
and 


fx (x) =f f fxx2len2) dy dz. 


One can also define conditional PDFs by formulas such as 


_ Fx,y,z(2;Y, 2) 


fx,y|z(@,y| 2) = aC ae for fz(z) > 0, 
_ Fxy,z(2,y, 2) 
fxyy,z(2|y, 2) = , for fyz(y,z) > 0. 
fy,.z(y, 2) 


There is an analog of the multiplication rule: 
fxy.z(2,y,2) = fxiy,2(@ | y,2)fyizy | 2) fz(z). 
Finally, we say that the three random variables X, Y, and Z are independent if 
fx,y,2(@,¥,2) = fx(@)fyy)falz), for all x,y, z. 


The expected value rule for functions takes the form 


El g(X, Y, Z)] = Jf [seu z)fxy,z(2, Y, z) dx dy dz, 
and if g is linear and of the form aX + bY + cZ, then 
EB[aX + bY + cZ] = aB[X] 4+ bE[Y] + cE[Z]. 


Furthermore, there are obvious generalizations of the above to the case of more 
than three random variables. For example, for any random variables X1, X2,...,Xn 
and any scalars a1, d@2,...,@n, we have 


Bla1X1 + a2Xq +--+ + anXn] = a1 E[X1] + a2E[X2] +--+ + anE[Xz]. 


3.6 
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DERIVED DISTRIBUTIONS 


We have seen that the mean of a function Y = g(X) of a continuous random 
variable X, can be calculated using the expected value rule 


Ey) = [ ore 


without first finding the PDF fy of Y. Still, in some cases, we may be interested 
in an explicit formula for fy. Then, the following two-step approach can be 
used. 


Calculation of the PDF of a Function Y = g(X) of a Continuous 
Random Variable X 


1. Calculate the CDF Fy of Y using the formula 
Fy(y) =P(o(X) <u) = f fix (a) de. 
{x | 9(«)<y} 


2. Differentiate to obtain the PDF of Y: 


ne ew) 


Example 3.21. Let X be uniform on [0,1]. Find the PDF of Y = VX. Note 
that Y takes values between 0 and 1. For every y € [0,1], we have 


FyW)=P¥ <9) =PWX <g) = P(X Sy) ay. 0S <1. 
We then differentiate and obtain 


dFy d(y”) 
<y<l. 
fy(y) ly (y) ] 2y, O<sy<l 


Outside the range [0,1], the CDF Fy (y) is constant, with Fy (y) = 0 for y < 0, and 
Fy(y) =1 for y> 1. By differentiating, we see that fy(y) = 0 for y outside [0, 1]. 


Example 3.22. John Slow is driving from Boston to the New York area, a 
distance of 180 miles. His average speed is uniformly distributed between 30 and 
60 miles per hour. What is the PDF of the duration of the trip? 
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Let X be the speed and let Y = g(X) be the trip duration: 


To find the CDF of Y, we must calculate 
180 180 
P(Y <y)=P < =P < XxX). 
(Vs) =P( <y) (= < ) 
We use the given uniform PDF of X, which is 
1/30 if 30 < a < 60, 


fx (x) = { 0 otherwise, 
and the corresponding CDF, which is 


0 if x < 30 
Fx (x) =< (a — 30)/30 if 30<2< 60, 


1 if 60 <a. 
Thus, 
y 
ot (0) 
y 

0 if y < 180/60, 
J) 18049 

es “a if 180/60 < y < 180/30, 

1 if 180/30 < y, 


0 ify<3 
{2-6/0 if3<y <6, 
1 if6<y, 
(see Fig. 3.20). Differentiating this expression, we obtain the PDF of Y: 
0 ify<3 
fry) = 6/y? £3 <y <6, 
0 if6<y. 


Example 3.23. Let Y = g(X) = X’, where X is a random variable with known 
PDF. For any y > 0, we have 


and therefore, by differentiating and using the chain rule, 


fy(y) = agi) + 7 glial Jd); y>0. 
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Figure 3.20: The calculation of the PDF of Y = 180/X in Example 3.22. The 
arrows indicate the flow of the calculation. 


The Linear Case 


An important case arises when Y is a linear function of X. See Fig. 3.21 for a 
graphical interpretation. 


The PDF of a Linear Function of a Random Variable 


Let X be a continuous random variable with PDF fx, and let 
Y=aX +b, 


for some scalars a 4 0 and b. Then, 


To verify this formula, we use the two-step procedure. We only show the 
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Figure 3.21: The PDF of aX + 6b in terms of the PDF of X. In this figure, 
a=2andb=5. Asa first step, we obtain the PDF of aX. The range of Y is 
wider than the range of X, by a factor of a. Thus, the PDF fx must be stretched 
(scaled horizontally) by this factor. But in order to keep the total area under the 
PDF equal to 1, we need to scale the PDF (vertically) by the same factor a. The 
random variable aX + b is the same as aX except that its values are shifted by 
b. Accordingly, we take the PDF of aX and shift it (horizontally) by b. The end 
result of these operations is the PDF of Y = aX +b and is given mathematically 
by 


If a were negative, the procedure would be the same except that the 
PDF of X would first need to be reflected around the vertical axis (“flipped”) 
yielding f_x. Then a horizontal and vertical scaling (by a factor of |a| and 1/|a], 
respectively) yields the PDF of —|a|X = aX. Finally, a horizontal shift of b would 
again yield the PDF of aX + b. 


steps for the case where a > 0; the case a < 0 is similar. We have 


Fy(y) =P(Y <y) 
=P(aX +b<y) 


We now differentiate this equality and use the chain rule, to obtain 


dF; 1 dF —b 1 —b 
f(a) = Fe) = = Se (HE) =F (EP). 


a a 


_Jrer” if >0, 
fx (a) = . otherwise, 


43 


Example 3.24. A linear function of an exponential random variable. 
Suppose that X is an exponential random variable with PDF 
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where \ is a positive parameter. Let Y = aX + 6. Then, 


A L-My-by/a 
ae - > 
won| ial if (y— b)/a > 0, 
0 otherwise. 
Note that if b = 0 and a > 0, then Y is an exponential random variable with 
parameter A/a. In general, however, Y need not be exponential. For example, if 
a <0 and b=0, then the range of Y is the negative real axis. 


Example 3.25. A linear function of a normal random variable is normal. 
Suppose that X is a normal random variable with mean pu and variance o7, and let 
Y =aX +5, where a and 6 are some scalars. We have 


fA =a ee 


V210 


Therefore, 


— 11 o-Uy-0)/a)- 4)? /20? 
jal /2ro 
Loyd)? /20? 0? | 


— V2rlalo 
2 


We recognize this as a normal PDF with mean ay + 6 and variance a?o?. In 
particular, Y is a normal random variable. 


The Monotonic Case 


The calculation and the formula for the linear case can be generalized to 
the case where g is a monotonic function. Let X be a continuous random variable 
and suppose that its range is contained in a certain interval J, in the sense that 
fx(x) = 0 for x ¢ I. We consider the random variable Y = g(X), and assume 
that g is strictly monotonic over the interval I. That is, either 


(a) g(x) < g(a’) for all x,x’ € TI satisfying x < x’ (monotonically increasing 
case), or 

(b) g(a) > g(x’) for all x, a’ € I satisfying x < x’ (monotonically decreasing 
case). 


Furthermore, we assume that the function g is differentiable. Its derivative 
will necessarily be nonnegative in the increasing case and nonpositive in the 
decreasing case. 
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An important fact is that a monotonic function can be “inverted” in the 
sense that there is some function h, called the inverse of g, such that for all 
x € I, we have y = g(x) if and only if x = h(y). For example, the inverse of the 
function g(a) = 180/a considered in Example 3.22 is h(y) = 180/y, because we 
have y = 180/z if and only if x = 180/y. Other such examples of pairs of inverse 
functions include 


—b 
g(x) = ax +b, h(y) = 2—, 


where a and 6 are scalars with a 4 0 (see Fig. 3.22), and 


g(x) = e%, A(y) =—, 


where a is a nonzero scalar. 


Yh 
Q(X) = ax+b 
b 
Slope a 
0 x 


xh 


Figure 3.22: A monotonically increasing function g (on the left) and its inverse 
(on the right). Note that the graph of h has the same shape as the graph of g, 
except that it is rotated by 90 degrees and then reflected (this is the same as 
interchanging the x and y axes). 


For monotonic functions g, the following is a convenient analytical formula 
for the PDF of the function Y = g(X). 
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PDF Formula for a Monotonic Function of a Continuous Random 
Variable 


Suppose that g is monotonic and that for some function hf and all x in the 
range I of X we have 


y = g(x) ifand only if «=A(y). 


Assume that h has first derivative (dh/dy)(y). Then the PDF of Y in the 
region where fy(y) > 0 is given by 


dh 


OG) zt) | 


For a verification of the above formula, assume first that g is monotonically 
increasing. Then, we have 


Fy(y) = P(g(X) < y) = P(X < Aly) = Fx(h(y)), 


where the second equality can be justified using the monotonically increasing 
property of g (see Fig. 3.23). By differentiating this relation, using also the 
chain rule, we obtain 


dh 


fry) = ew) = fx (hy) g)- 


Because g is monotonically increasing, h is also monotonically increasing, so its 
derivative is positive: 

dh dh 

Fw) =|Fu). 


This justifies the PDF formula for a monotonically increasing function g. The 
justification for the case of monotonically decreasing function is similar: we 
differentiate instead the relation 


Fy(y) = P(g(X) < y) =P(X > h(y)) =1- Fx (Aly), 


and use the chain rule. 
There is a similar formula involving the derivative of g, rather than the 
derivative of h. To see this, differentiate the equality g(h(y)) = y, and use the 


chain rule to obtain 
dg dh 
ap (ely)) ay =1. 
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Let us fix some x and y that are related by g(x) = y, which is the same as 
h(y) = x. Then, 
dg dh 


7 (x) - dy =1, 


which leads to 


dg 


fry) = Fx(@) / 


Event {X< h(Y)} , Event {X >h(Y)} 


Figure 3.23: Calculating the probability P (9(X) < y) . When g is monotonically 
increasing (left figure), the event {g(X) < y} is the same as the event {X < h(y)}. 
When g is monotonically decreasing (right figure), the event {g(X) < y} is the 
same as the event {X > h(y)}. 


Example 3.22. (Continued) To check the PDF formula, let us apply it to 
the problem of Example 3.22. In the region of interest, x € [30,60], we have 
h(y) = 180/y, and 


dFx 
dh (h(y)) = 30’ 


i (y) 


dh - 180 


Thus, in the region of interest y € [3,6], the PDF formula yields 


consistently with the expression obtained earlier. 


Example 3.26. Let Y = g(X) = X°, where X is a continuous uniform random 
variable in the interval (0,1]. Within this interval, g is monotonic, and its inverse 
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is h(y) = ,/y. Thus, for any y € (0, 1], we have 


dh 1 
0 = op fx(Vy) =1, 
and j 
fry) = Bie ee 
0 otherwise. 


We finally note that if we interpret PDFs in terms of probabilities of small 
intervals, the content of our formulas becomes pretty intuitive; see Fig. 3.24. 


Functions of Two Random Variables 


The two-step procedure that first calculates the CDF and then differentiates to 
obtain the PDF also applies to functions of more than one random variable. 


Example 3.27. Two archers shoot at a target. The distance of each shot from 
the center of the target is uniformly distributed from 0 to 1, independently of the 
other shot. What is the PDF of the distance of the losing shot from the center? 

Let X and Y be the distances from the center of the first and second shots, 
respectively. Let also Z be the distance of the losing shot: 


Z = max{X,Y}. 


We know that X and Y are uniformly distributed over [0, 1], so that for all z € [0, 1], 
we have 


P(X <z)=P(Y <z)=z. 
Thus, using the independence of X and Y, we have for all z € [0,1], 


Fz(z) = P(max{X, Y}< z) 
= P(X <2z,Y <z) 
= P(X < z)P(Y < z) 


2 
=2. 


Differentiating, we obtain 


Wn) = a ifO0<z<1, 
0 otherwise. 


Example 3.28. Let X and Y be independent random variables that are uniformly 
distributed on the interval [0,1]. What is the PDF of the random variable Z = 
Y/X? 
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[x, X+04] 


Figure 3.24: Illustration of the PDF formula for a monotonically increasing 
function g. Consider an interval [x,« + 61], where 6; is a small number. Under 
the mapping g, the image of this interval is another interval [y, y + 62]. Since 
(dg/dx)(x) is the slope of g, we have 


62 dg 
on ae 


’ 


or in terms of the inverse function, 


6; _ dh 


is ay 


We now note that the event {x < X < x+ 6 } is the same as the event {y << Y < 
y + 62}. Thus, 
fy (y)d2 ® Py SY Sy + 62) 


=P(a< X <2+561) 


& fx (x61. 

We move 6, to the left-hand side and use our earlier formula for the ratio 62/61, 
to obtain 

dg 

fy W2(e) = fx(e). 
xr 
Alternatively, if we move 62 to the right-hand side and use the formula for 6; /é2, 
we obtain an 
fy (y) = fx (h(y)) - ae 


We will find the PDF of Z by first finding its CDF and then differentiating. 
We consider separately the cases 0 < z < 1 and z > 1. As shown in Fig. 3.25, we 
have 


Y z/2 fO<z<1, 
Fe(2)=P (5 <2) S411 Oe) shes 4, 
0 otherwise. 
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By differentiating, we obtain 


1/2 fo0<2< i, 
fz(z) = 1/(227) ifz>1, 


0 otherwise. 


Figure 3.25: The calculation of the CDF of Z = Y/X in Example 3.28. The 
value P(Y/X < z) is equal to the shaded subarea of the unit square. The figure 
on the left deals with the case where 0 < z < 1 and the figure on the right refers 
to the case where z > 1. 


Example 3.29. Romeo and Juliet have a date at a given time, and each, inde- 

pendently, will be late by an amount of time that is exponentially distributed with 

parameter A. What is the PDF of the difference between their times of arrival? 
Let us denote by X and Y the amounts by which Romeo and Juliet are late, 

respectively. We want to find the PDF of Z = X — Y, assuming that X and Y are 

independent and exponentially distributed with parameter X. We will first calculate 

the CDF F(z) by considering separately the cases z > 0 and z < 0 (see Fig. 3.26). 
For z > 0, we have (see the left side of Fig. 3.26) 


Fz(z)=P(X -Y < z) 
=1-P(xX-Y>2z) 


=1 -{ ( fx,y (x,y) ie) dy 
0 zty 
1 -| aa (/ ie ic) dy 
0 zty 


=1 -{ ree AGT) dy 
0 
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y Line x-y=z y Line x-y=Z 


\ 


Figure 3.26: The calculation of the CDF of Z = X —Y in Example 3.29. To 
obtain the value P(X — Y > z) we must integrate the joint PDF fx y (a, y) 
over the shaded area in the above figures, which correspond to z > 0 (left 
side) and z < 0 (right side). 


For the case z < 0, we can use a similar calculation, but we can also argue 
using symmetry. Indeed, the symmetry of the situation implies that the random 
variables Z = X — Y and —Z = Y — X have the same distribution. We have 


Fz(z) =P(Z < z) =P(-Z > -z) = P(Z > -z) =1- Fa(-z). 


With z < 0, we have —z > 0 and using the formula derived earlier, 


PKS Si (1 se?) = 5e™ 


Combining the two cases z > 0 and z < 0, we obtain 


1- ee if z>0, 
Fz(z) = tx 
5° if z < 0, 


We now calculate the PDF of Z by differentiating its CDF. We obtain 


os ifz>0, 
fa(z) = 
AM ifz <0, 
or ‘ 
fz(z) = ae 


This is known as a two-sided exponential PDF, also known as the Laplace 
PDF. 


3.7 
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SUMMARY AND DISCUSSION 


Continuous random variables are characterized by PDF's and arise in many ap- 
plications. PDFs are used to calculate event probabilities. This is similar to 
the use of PMFs for the discrete case, except that now we need to integrate 
instead of adding. Joint PDFs are similar to joint PMFs and are used to de- 
termine the probability of events that are defined in terms of multiple random 
variables. Finally, conditional PDFs are similar to conditional PMFs and are 
used to calculate conditional probabilities, given the value of the conditioning 
random variable. 

We have also introduced a few important continuous probability laws and 
derived their mean and variance. A summary is provided in the table that 
follows. 


Summary of Results for Special Random Variables 


Continuous Uniform Over [{a, }]: 


1 : 
inte) =| Fa eebae. 
0 otherwise, 


a+b (b— a)? 


E[x] =“, 


Exponential with Parameter ): 


Aer if x > 0, _ jJl-e ifa>0, 
Fx(#) = . otherwise, Exe) = {5 otherwise, 
1 
E[X] = -, var(X) = ve 
Normal with Parameters jy and o?: 
Px (2) = —L-e-(o-m)?/20?, 


V2T0 


E[X] = py, var(X) = 07. 


Further Topics 


on Random Variables and Expectations 


Contents 
4:La Transtorms: «° 2 fea le a a ee Re RO p. 2 
4.2. Sums of Independent Random Variables - Convolutions ... p.13 
4.3. Conditional Expectation asa Random Variable ...... . p. 17 
4.4. Sum of a Random Number of Independent Random Variables _ p. 25 
4.5. Covariance and Correlation .............2., p. 29 
4.6. Least Squares Estimation ..............2.-, p. 32 
4.7. The Bivariate Normal Distribution .........2.2.~, p. 39 
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In this chapter, we develop a number of more advanced topics. We introduce 
methods that are useful in: 


(a) dealing with the sum of independent random variables, including the case 
where the number of random variables is itself random; 


(b) addressing problems of estimation or prediction of an unknown random 
variable on the basis of observed values of other random variables. 


With these goals in mind, we introduce a number of tools, including transforms 
and convolutions, and refine our understanding of the concept of conditional 
expectation. 


TRANSFORMS 


In this section, we introduce the transform associated with a random variable. 
The transform provides us with an alternative representation of its probability 
law (PMF or PDF). It is not particularly intuitive, but it is often convenient for 
certain types of mathematical manipulations. 

The transform of the distribution of a random variable X (also referred 
to as the moment generating function of X) is a function Mx/(s) of a free 
parameter s, defined by 

Mx(s) = Efe**}. 
The simpler notation M(s) can also be used whenever the underlying random 
variable X is clear from the context. In more detail, when X is a discrete random 
variable, the corresponding transform is given by 


M(s) = > e*px(a), 


x 


while in the continuous case, we havet 


M(s) = i * Surya di 


—oo 


Example 4.1. Let 
1/2, if eS 2, 
px(x) = 4 1/6, ifa =3, 


1/3, ife=5. 


+ The reader who is familiar with Laplace transforms may recognize that the trans- 
form associated with a continuous random variable is essentially the same as the Laplace 
transform of its PDF, the only difference being that Laplace transforms usually involve 
e °” rather than e*”. For the discrete case, a variable z is sometimes used in place 
of e* and the resulting transform M(z) = )>, 2"px(x) is known as the z-transform. 


However, we will not be using z-transforms in this book. 
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Then, the corresponding transform is 


M(s) _ 1 20 1 13s | =e 


2 6 


(see Fig. 4.1). 


Figure 4.1: The PMF and the corresponding transform for Example 4.1. The 
transform M(s) consists of the weighted sum of the three exponentials shown. 
Note that at s = 0, the transform takes the value 1. This is generically true since 


M(0) = S- e° "px (2) = S/ px(2) =1. 


BH 


Example 4.2. The Transform of a Poisson Random Variable. Consider a 
Poisson random variable X with parameter A: 


te 
px(x) = ac = 105 1s 


The corresponding transform is given by 


M(s) = S- e — : 


a! 
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We let a = e*X and obtain 


Example 4.3. The Transform of an Exponential Random Variable. Let 
X be an exponential random variable with parameter 4: 


fx(x) = de>”, x>0. 


Then, 


els Ae SS é x 
are (if s < A) 
0) 
= WX 
ds 


The above calculation and the formula for M(s) is correct only if the integrand 
el(s-A)e decays as x increases, which is the case if and only if s < A; otherwise, the 
integral is infinite. 


It is important to realize that the transform is not a number but rather a 
function of a free variable or parameter s. Thus, we are dealing with a transfor- 
mation that starts with a function, e.g., a PDF fx(x) (which is a function of a 
free variable x) and results in a new function, this time of a real parameter s. 
Strictly speaking, M(s) is only defined for those values of s for which E/e**] is 
finite, as noted in the preceding example. 


Example 4.4. The Transform of a Linear Function of a Random Variable. 
Let Mx(s) be the transform associated with a random variable X. Consider a new 
random variable Y = aX + b. We then have 


My (s) = Efes“**”] = e* Ble***] = e* Mx (sa). 


For example, if X is exponential with parameter A = 1, so that Mx(s) = 1/(1-—s), 
and if Y = 2X 4+ 3, then 
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Example 4.5. The Transform of a Normal Random Variable. Let X 
be a normal random variable with mean y and variance 0”. To calculate the 
corresponding transform, we first consider the special case of the standard normal 
random variable Y, where . = 0 and o? = 1, and then use the formula of the 
preceding example. The PDF of the standard normal is 


1 my) 
fy (y) =e’ oe 


and its transform is 


fore} 1 t9 , 
uy(s)= | Ta 12 84 dy 
lf” -@P ate 
ey Ne d 
Von a . 
2 1 ee 2 2 
— ps /2 —(y* /2)+sy—(s*/2) 
=e e d 
ae 


2 1 a 2 
a2 8/2 —(y—s)*/2 
=e — e d 
Var fe 


where the last equality follows by using the normalization property of a normal 
PDF with mean s and unit variance. 


2 


A general normal random variable with mean yz and variance o~ is obtained 


from the standard normal via the linear transformation 
X=oY+u. 


2 
The transform of the standard normal is My(s) = e° /? as verified above. By 
applying the formula of Example 4.4, we obtain 


o2s2 
Mx(s) = e*My(so) =e 2 T. 


From Transforms to Moments 


The reason behind the alternative name “moment generating function” is that 
the moments of a random variable are easily computed once a formula for the 
associated transform is available. To see this, let us take the derivative of both 
sides of the definition 
co 
M(s) = / est fx (x) da, 


—oo 
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with respect to s. We obtain 


NRG) ws al es fx (x) dx 


fi 10° fx (a) dx 
=f wer fxle ode. 


This equality holds for all values of s. By considering the special case where 
s = 0, we obtain? 


= [ afx(2) de = BX), 


s=0 —oo 


More generally, if we differentiate n times the function M(s) with respect to s, 
a similar calculation yields 


oe = [ arfule)de = BL" 


Example 4.6. We saw earlier (Example 4.1) that the PMF 


1/2, ife=2, 
px(x) = 4 1/6, ifa =3, 
1/3, ife=5, 


has the transform 


1 1 1 5 
M(s) oon ae ao 
Thus, 
d 
E|X] = —M 
lS gate) 
1 2s 1 3s 1 5s 
= =2 + —3 + -—5 
Op eager en sae lle 
1 1 
=.-2 73 a) 
a Saa 
_19 


+ This derivation involves an interchange of differentiation and integration. The 
interchange turns out to be justified for all of the applications to be considered in 
this book. Furthermore, the derivation remains valid for general random variables, 
including discrete ones. In fact, it could be carried out more abstractly, in the form 


d d d : 
ds ds 


E[e°*] = E .| = B[Xe"*], 


leading to the same conclusion. 
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Also, 
2 d? 
E[LX*] = —~M 
[IX] = Fa eile 
1 2s 1 3s 1 5s 
= 4 t + —2 
94e 2° 3 25e hes 
1 1 1 
eae | ; .2 
5) +& +3 5 
7 
= 


For an exponential random variable with PDF 
fx(z) = de>”, xz>0, 


we found earlier that 


nN 
M(s) = 
aoa wae 

Thus, 

d r @ 2X 

ig = Bena Ge = (aa 
By setting s = 0, we obtain 

il 2, 2 
B[X] = 5. BIX*|= 5, 


which agrees with the formulas derived in Chapter 3. 


Inversion of Transforms 


A very important property of transforms is the following. 


Inversion Property 


The transform Mx (s) completely determines the probability law of the ran- 
dom variable X. In particular, if Mx(s) = My(s) for all s, then the random 
variables X and Y have the same probability law. 


This property is a rather deep mathematical fact that we will use fre- 
quently.t There exist explicit formulas that allow us to recover the PMF or 
PDF of a random variable starting from the associated transform, but they are 
quite difficult to use. In practice, transforms are usually inverted by “pattern 
matching,” based on tables of known distribution-transform pairs. We will see 
a number of such examples shortly. 


t In fact, the probability law of a random variable is completely determined even 
if we only know the transform M/(s) for values of s in some interval of positive length. 
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Example 4.7. We are told that the transform associated with a random variable 
X is 


5s 


Tos} 4 Nae 
= 8 


Since M(s) is a sum of terms of the form e*”, we can compare with the general 


formula 
M(s) = 5 _e**px(a), 


x 


iene 
8 


and infer that X is a discrete random variable. The different values that X can 
take can be read from the corresponding exponents and are —1, 0, 4, and 5. The 
probability of each value x is given by the coefficient multiplying the corresponding 
e** term. In our case, P(X = —1) = 1/4, P(X = 0) = 1/2, P(X = 4) = 1/8, 
P(X =5) = 1/8. 


Generalizing from the last example, the distribution of a finite-valued dis- 
crete random variable can be always found by inspection of the corresponding 
transform. The same procedure also works for discrete random variables with 
an infinite range, as in the example that follows. 


Example 4.8. The Transform of a Geometric Random Variable. We are 
told that the transform associated with random variable X is of the form 


= pe* 
M(s) = 1—(1—p)es’ 


where p is a constant in the range 0 < p < 1. We wish to find the distribution of 
X. We recall the formula for the geometric series: 


1 
—— =ltata?+-:--, 
l-a 


which is valid whenever |a| < 1. We use this formula with a = (1—p)e’, and for s 
sufficiently close to zero so that (1 — p)e* < 1. We obtain 


M(s) = pe*(1 + (1—p)e* + (1 — p)?e”* + (1 — p)*e** J ), 


As in the previous example, we infer that this is a discrete random variable that 
takes positive integer values. The probability P(X = k) is found by reading the 
coefficient of the term e**. In particular, P(X = 1) = p, P(X = 2) = p(1—p), etc., 
and 

P(X =k) =p(1—p)*"', k=1,2,... 


We recognize this as the geometric distribution with parameter p. 
Note that 


d pe* oe cp pe. 
pe T-0- ner 
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If we set s = 0, the above expression evaluates to 1/p, which agrees with the formula 
for EX] derived in Chapter 2. 


Example 4.9. The Transform of a Mixture of Two Distributions. The 
neighborhood bank has three tellers, two of them fast, one slow. The time to assist 
a customer is exponentially distributed with parameter \ = 6 at the fast tellers, 
and \ = 4 at the slow teller. Jane enters the bank and chooses a teller at random, 
each one with probability 1/3. Find the PDF of the time it takes to assist Jane and 
its transform. 

We have 


Then, 


‘i e®*6e ©” dx + ;f ede * dx 
0 3 Jo 


6 1 4 
fae se ae (for s < 4). 


More generally, let X1,..., Xn be continuous random variables with PDFs 
fx,,---fx,, and let Y be a random variable, which is equal to X; with probability 
pi. Then, 


fy (y) = pifxy(y) +2°> + Pnfxn(y) 


and 
My (s) = piMx, (s) +--+ + pnMx,(s). 


The steps in this problem can be reversed. For example, we may be told that 
the transform associated with a random variable Y is of the form 


1 a 3 i 
7 ay ee aa a ea 


We can then rewrite it as 


IS: , £39) Bo «451 
4 2-5 '4 1—s’ 


and recognize that Y is the mixture of two exponential random variables with 
parameters 2 and 1, which are selected with probabilities 1/4 and 3/4, respectively. 
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Sums of Independent Random Variables 


Transform methods are particularly convenient when dealing with a sum of ran- 
dom variables. This is because it turns out that addition of independent random 
variables corresponds to multiplication of transforms, as we now show. 

Let X and Y be independent random variables, and lett W = X + Y. The 
transform associated with W is, by definition, 


Mw(s) = Ele*¥] = Eles*+¥)] = BlesXes¥]. 


Consider a fixed value of the parameter s. Since X and Y are independent, 
e®* and e’Y are independent random variables. Hence, the expectation of their 
product is the product of the expectations, and 


Mw(s) = Efe**]E[es’] = Mx(s)My(s). 


By the same argument, if X1,...,Xn is a collection of independent random 
variables, and 
W=X%i4+---+Xn, 


then 
Mw/(s) = Mx, (s) sss Mx,,(s). 


Example 4.10. The Transform of the Binomial. Let Xi,...,Xn be inde- 
pendent Bernoulli random variables with a common parameter p. Then, 


Mx,(s) = (1— p)e°* + pe’* =1—p+ pe’, for all 7. 
The random variable Y = Xi +---+ Xn is binomial with parameters n and p. Its 


transform is given by 
My(s) = (1 —p+ pe’) ; 


Example 4.11. The Sum of Independent Poisson Random Variables is 
Poisson. Let X and Y be independent Poisson random variables with means > 
and p, respectively, and let W = X + Y. Then, 

Mx‘(s) = ae Ns My(s) = gre aN. 


and 


Mw/(s) _ Mx(s)My(s) = ere 1) enle® 1) = e ty)(e® 1). 


Thus, W has the same transform as a Poisson random variable with mean A + wu. 
By the uniqueness property of transforms, W is Poisson with mean + pu. 
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Example 4.12. The Sum of Independent Normal Random Variables is 
Normal. Let X and Y be independent normal random variables with means pz, 
[ty, and variances o2, Ge, respectively. Let W = X + Y. Then, 


ee o2s? 
Mx(s)=e 2 tH, My(s) =e72_ ts 


and 
(02 +02)? 
Mw(s) = ew tha thy) s 


Thus, W has the same transform as a normal random variable with mean pig + py 
and variance a2 + oe By the uniqueness property of transforms, W is normal with 
these parameters. 


Summary of Transforms and their Properties 


e The transform associated with the distribution of a random variable 
X is given by 


S- est px (x), x discrete, 


est fx (x) da, x continuous. 


The distribution of a random variable is completely determined by the 
corresponding transform. 


Moment generating properties: 


e If Y =aX +5, then My(s) = e*?Mx (as). 
e If X and Y are independent, then Mx+y(s) = Mx(s)My(s). 


We have derived formulas for the transforms of a few common random 
variables. Such formulas can be derived with a moderate amount of algebra for 
many other distributions. Some of the most useful ones are summarized in the 
tables that follow. 


Transforms of Joint Distributions 


If two random variables X and Y are described by some joint distribution (e.g., a 
joint PDF), then each one is associated with a transform Mx(s) or My(s). These 
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Transforms for Common Discrete Random Variables 


Bernoulli(p) 


px(i) = 44 ifk =0. se a a 


Geometric(p) 
px(k) = p(1—p)F-1, k= 1,2, Mx(s) = pe 
1—(1—p)e® 
Poisson(,) 
e-A)k ; 
px(k) = ao B= 0,1, Mx(s) = eMe-D, 
Uniform(a, b) 
1 


are the transforms of the marginal distributions and do not convey information on 
the dependence between the two random variables. Such information is contained 
in a multivariate transform, which we now define. 

Consider n random variables X1,..., Xn related to the same experiment. 
Let s1,...,5n be scalar free parameters. The associated multivariate transform 
is a function of these n parameters and is defined by 


LEX Si5 96.55%) = B[esXit--+snXn], 


The inversion property of transforms discussed earlier extends to the multi- 
variate case. That is, if Y1,...,Yn is another set of random variables and 
Mx,.,...,.Xn(81,-++,8n), My,,....¥,(S1,---, Sn) are the same functions of s1,..., 8n, 


4.2 
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Transforms for Common Continuous Random Variables 


Uniform(a, b) 


1 1 sb _ esa 
fx(x) = , a<a<d. Mx(s) = se 
b-a b-—a Ss 
Exponential(,) 
r 
fx(z) = Ae, «x >0. Mx(s) = ~—> (s >A). 
—s 
Normal(,i, c?) 
1 2 2 a2 52 
Ine) = BAPE) PAA eee <bep, Mx(s) Se a FM, 
oV2T 
then the joint distribution of X1,..., Xn is the same as the joint distribution of 


eee 


SUMS OF INDEPENDENT RANDOM VARIABLES 
— CONVOLUTIONS 


If X and Y are independent random variables, the distribution of their sum 
W = X+Y can be obtained by computing and then inverting the transform 
Mwy (s) = Mx(s)My(s). But it can also be obtained directly, using the method 
developed in this section. 


The Discrete Case 


Let W = X+Y, where X and Y are independent integer-valued random variables 
with PMFs px(a) and py(y). Then, for any integer w, 
pw(w) =P(X + Y =w) 
= SS P(X =z and Y =y) 
(@,y): e+y=w 
=o PCa and Y = w— 2) 


= 0 px(a)py (w — 2). 
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Figure 4.2: The probability pw (3) that X+Y = 3 is the sum of the probabilities 
of all pairs (x,y) such that «+ y = 3, which are the points indicated in the 
figure. The probability of a generic such point is of the form px,y(x,3— x) = 
Px (&)py (3 — 2). 


The resulting PMF py (w) is called the convolution of the PMFs of X and Y. 
See Fig. 4.2 for an illustration. 


Example 4.13. Let X and Y be independent and have PMFs given by 


+ if2=0, 
il! : = 1 i 
_J3 io = 1, 2,3, =) = tea, 
px (x) { 0 otherwise, py (y) 7 if = 2, 
0 otherwise. 


To calculate the PMF of W = X + Y by convolution, we first note that the range 
of possible values of w are the integers from the range [1,5]. Thus we have 


pw(w) =0 ifw¥1,2,3,4,5. 


We calculate pw(w) for each of the values w = 1,2,3,4,5 using the convolution 
formula. We have 


pw(1) = > px(#)py (I~ ©) = px(1)- py (0) = 


where the second equality above is based on the fact that for « 4 1 either px (x) or 
py(1—~) (or both) is zero. Similarly, we obtain 


pw(2) = px(1)-py (1) +px2):pr(0) = 3-34 5°5= 55 
pw (3) = px(1) py (2) + px(2)-py(1) +px(8)-py()=5- Eta gta gay 
pw(4) = px(2)-py(2) + px(3)-py(l) = 5-3 45° 3> 7 
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The Continuous Case 


Let X and Y be independent continuous random variables with PDFs fx (x) and 
fy(y). We wish to find the PDF of W = X +Y. Since W is a function of two 
random variables X and Y, we can follow the method of Chapter 3, and start 
by deriving the CDF Fi(w) of W. We have 


Fw(w) = P(W < w) 
=P(X+Y <vw) 
=f ft pwayas 


= [eo i fel) dy) 


= ii fx(x)Fy(w — x) dz. 


=—oco 


The PDF of W is then obtained by differentiating the CDF: 


= i: fix(o) fy (w — 2) de. 


=—0o 


This formula is entirely analogous to the formula for the discrete case, except 
that the summation is replaced by an integral and the PMFs are replaced by 
PDFs. For an intuitive understanding of this formula, see Fig. 4.3. 


Example 4.14. The random variables X and Y are independent and uniformly 
distributed in the interval [0,1]. The PDF of W=X+Y is 


fw (w) = fx (x) fy (w — x) da. 


The integrand fx(x)fy(w — x) is nonzero (and equal to 1) for 0 < « < 1 and 
0 <w-—2x < 1. Combining these two inequalities, the integrand is nonzero for 
max{0,w —1} <a < min{1,w}. Thus, 


_ f min{1,w} — max{0,w-—1}, O0<w< 2, 
fe) = 1 otherwise, 
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Figure 4.3: Illustration of the convolution formula for the case of continuous 
random variables (compare with Fig. 4.2). For small 6, the probability of the 
strip indicated in the figure is P(w < X + Y <w+6)® fw(w)-6. Thus, 


fw(w)-6 =P(w< X+Y <w+6) 


w—-2t+s 
=i [ fx (x) fy (y) dy da 
=—0o Jy= 


=f fx (x) fy (w — x)é6 da. 


The desired formula follows by canceling 6 from both sides. 


fiw) 


Figure 4.4: The PDF of the sum of two independent uniform random variables 
n (0, 1]. 


which has the triangular shape shown in Fig. 4.4. 


The calculation in the last example was based on a literal application of the 
convolution formula. The most delicate step was to determine the correct limits 
for the integration. This is often tedious and error prone, but can be bypassed 
using a graphical method described next. 
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Graphical Calculation of Convolutions 


We will use a dummy variable t as the argument of the different functions in- 
volved in this discussion; see also Fig. 4.5. Consider a PDF fx(t) which is zero 
outside the range a < t < b and a PDF fy(t) which is zero outside the range 
c<t<d. Let us fix a value w, and plot fy(w — t) as a function of t. This plot 
has the same shape as the plot of fy(t) except that it is first “flipped” and then 
shifted by an amount w. (If w > 0, this is a shift to the right, if w < 0, this isa 
shift to the left.) We then place the plots of fx(t) and fy(w—t) on top of each 
other. The value of fyw(w) is equal to the integral of the product of these two 
plots. By varying the amount w by which we are shifting, we obtain fy(w) for 
any w. 


Figure 4.5: Illustration of the convolution calculation. For the value of w under 
consideration, fy (w) is equal to the integral of the function shown in the last 
plot. 


4.3 CONDITIONAL EXPECTATION AS A RANDOM VARIABLE 


The value of the conditional expectation ELX | Y = y] of a random variable X 
given another random variable Y depends on the realized experimental value y 
of Y. This makes E[X | Y] a function of Y, and therefore a random variable. In 
this section, we study the expectation and variance of E[X | Y]. In the process, 
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we obtain some useful formulas (the law of iterated expectations and the 
law of conditional variances) that are often convenient for the calculation of 
expected values and variances. 

Recall that the conditional expectation E[LX | Y = y] is defined by 


E[X|Y =y]= S- rpx|y(x|y), (discrete case), 


and 
CoO 


E[X |Y =y] = / Lfxiy(x|y) dz, (continuous case). 
Once a value of y is given, the above summation or integration yields a numerical 
value for E[X | Y = y]. 


Example 4.15. Let the random variables X and Y have a joint PDF which 
is equal to 2 for (x,y) belonging to the triangle indicated in Fig. 4.6(a), and zero 
everywhere else. In order to compute E[X |Y = y], we first need to obtain the 
conditional density of X given Y = y. 


fxytely) 
A 


ay 
y 


Figure 4.6: (a) The joint PDF in Example 4.15. (b) The conditional density 
of X. 


We have 
co 1-y 

friy= J feveude= f “2de=20-9), OS ut, 
—0o 0 


and 


_ fxy@y)_ 1 


O<a<1l-y. 


The conditional density is shown in Fig. 4.6(b). 
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Intuitively, since the joint PDF is constant, the conditional PDF (which is a 
“slice” of the joint, at some fixed y) is also a constant. Therefore, the conditional 
PDF must be a uniform distribution. Given that Y = y, X ranges from 0 to 1— y. 
Therefore, for the PDF to integrate to 1, its height must be equal to 1/(1 — y), in 
agreement with Fig. 4.6(b). 

For y > 1 or y < 0, the conditional PDF is undefined, since these values of 
y are impossible. For y = 1, X must be equal to 0, with certainty, and E[X |Y = 
1] =0. 

For 0 < y < 1, the conditional mean E[X | Y = y] is the expectation of the 
uniform PDF in Fig. 4.6(b), and we have 


1-— 
E[X|¥ =y]=—*, O<y<l. 


Since ELX | Y = 1] = 0, the above formula is also valid when y = 1. The conditional 
expectation is undefined when y is outside [0, 1]. 


For any number y, E[X |Y = y] is also a number. As y varies, so does 
E[X | Y = y], and we can therefore view E[X | Y = y] as a function of y. Since 
y is the experimental value of the random variable Y, we are dealing with a 
function of a random variable, hence a new random variable. More precisely, we 
define E[X | Y] to be the random variable whose value is E[X | Y = y] when the 
outcome of Y is y. 


Example 4.15. (continued) We saw that E[X|Y = y] = (1 —y)/2. Hence, 
E[X | Y] is the random variable (1 — Y)/2: 


Loe 
E[X |Y] = ——. 


Since ELX | Y] is a random variable, it has an expectation E[E[X | Y]] of 
its own. Applying the expected value rule, this is given by 


S> E[X |Y = ylpy(y), Y discrete, 
E[E[X|Y]] = 4 pox 
i E[X | Y = y]fy(y) dy, Y continuous. 


Both expressions in the right-hand side should be familiar from Chapters 2 and 
3, respectively. By the corresponding versions of the total expectation theorem, 
they are equal to E[X]. This brings us to the following conclusion, which is 
actually valid for every type of random variable Y (discrete, continuous, mixed, 
etc.), as long as X has a well-defined and finite expectation E[X]. 


Law of iterated expectations: E[ELX | Y]] = E[X]. 


20 
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Example 4.15 (continued) In Example 4.15, we found ELX | Y] = (1— Y)/2 
[see Fig. 4.6(b)]. Taking expectations of both sides, and using the law of iterated 
expectations to evaluate the left-hand side, we obtain E[X] = (1-E[Y]) /2. Because 
of symmetry, we must have E[X] = E[Y]. Therefore, E[X] = (1 —E|xX }) /2, which 
yields E[X] = 1/3. In a slightly different version of this example, where there is no 
symmetry between X and Y, we would use a similar argument to express E[Y]. 


Example 4.16. We start with a stick of length ¢. We break it at a point which 
is chosen randomly and uniformly over its length, and keep the piece that contains 
the left end of the stick. We then repeat the same process on the stick that we 
were left with. What is the expected length of the stick that we are left with, after 
breaking twice? 

Let Y be the length of the stick after we break for the first time. Let X be 
the length after the second time. We have ELX | Y] = Y/2, since the breakpoint is 
chosen uniformly over the length Y of the remaining stick. For a similar reason, we 
also have E[Y] = €/2. Thus, 


E[X] = E[E[X|Y]] =E B ast 


2 2 4’ 


Example 4.17. Averaging Quiz Scores by Section. A class has n students 
and the quiz score of student 7 is x;. The average quiz score is 


n 
3 
m= Xie 
n 
i=1 


The class consists of S sections, with n, students in section s. The average score 


in section s is i 
Ms = — ) Xe 
n 


stdnts. 7 in sec. s 


The average score over the whole class can be computed by taking the average score 
mg of each section, and then forming a weighted average; the weight given to section 
s is proportional to the number of students in that section, and is n;/n. We verify 
that this gives the correct result: 


Ss S 
Ns ns 1 

Pr aa eed ~ 
n nn 

s=1 


s=1 stdnts. 7 in sec. s 


5 


s=1 stdnts. 7 in sec. s 


n 
1 yy 
— Xie 
n 
i=1 


=m. 
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How is this related to conditional expectations? Consider an experiment in 
which a student is selected at random, each student having probability 1/n of being 
selected. Consider the following two random variables: 


X = quiz score of a student, 
Y =section of a student, (Y € {1,...,S}). 
We then have 
E[X] =m. 


Conditioning on Y = s is the same as assuming that the selected student is 
in section s. Conditional on that event, every student in that section has the same 
probability 1/ns of being chosen. Therefore, 


BX |yes] = S- Li = Ms. 
n 


stdnts. i in sec. s 


A randomly selected student belongs to section s with probability n/n, ie., P(Y = 
8) =ns/n. Hence, 


S S 
E[E[X|Y]] = )E[X|Y =s)P(V =s)= 5° ems. 


s=1 


As shown earlier, this is the same as m. Thus, averaging by section can be viewed 
as a special case of the law of iterated expectations. 


Example 4.18. Forecast Revisions. Let Y be the sales of a company in the 
first semester of the coming year, and let X be the sales over the entire year. The 
company has constructed a statistical model of sales, and so the joint distribution of 
X and Y is assumed to be known. In the beginning of the year, the expected value 
E[X] serves as a forecast of the actual sales X. In the middle of the year, the first 
semester sales have been realized and the experimental value of the random value Y 
is now known. This places us in a new “universe,” where everything is conditioned 
on the realized value of Y. We then consider the mid-year revised forecast of yearly 
sales, which is E[X | Y]. 

We view E[X | Y] — ELX] as the forecast revision, in light of the mid-year 
information. The law of iterated expectations implies that 


E/E[X | Y] — E[X]] =0. 


This means that, in the beginning of the year, we do not expect our forecast to 
be revised in any specific direction. Of course, the actual revision will usually be 
positive or negative, but the probabilities are such that it is zero on the average. 
This is quite intuitive. For example, if a positive revision was expected, the original 
forecast should have been higher in the first place. 
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The Conditional Variance 


The conditional distribution of X given Y = y has a mean, which is ELX | Y = y], 
and by the same token, it also has a variance. This is defined by the same formula 
as the unconditional variance, except that everything is conditioned on Y = y: 


var(X|Y = y) = E|(X - E[X|Y =y])” | ¥ Sy), 


Note that the conditional variance is a function of the experimental value y of 
the random variable Y. Hence, it is a function of a random variable, and is itself 
a random variable that will be denoted by var(X | Y). 

Arguing by analogy to the law of iterated expectations, we may conjecture 
that the expectation of the conditional variance var(X|Y) is related to the 
unconditional variance var(X). This is indeed the case, but the relation is more 
complex. 


Law of Conditional Variances: 


var(X) = E[var(X | Y)] + var(ELX | Y]) 


To verify the law of conditional variances, we start with the identity 
X —E[X] = (X — E[X|Y]) + (E[X | Y] - E[X)). 
We square both sides and then take expectations to obtain 
var(X) = B(x = B[X])"| 
=E|(X - B[X|¥])”] + B[(BLX|Y] - B[X))"] 
+2E (x ~ E[X | Y]) (BX |Y] - E[Xx])| 


Using the law of iterated expectations, the first term in the right-hand side of 
the above equation can be written as 


B[E| (x ~E[X|¥])? | Y| lk 
which is the same as E[var(X |Y)]. The second term is equal to var(E[X | Y]), 


since E[X] is the mean of E[X | Y]. Finally, the third term is zero, as we now 
show. Indeed, if we define h(Y) = 2(E[X | Y] — E[X)), the third term is 
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Example 4.16. (continued) Consider again the problem where we break twice 
a stick of length @, at randomly chosen points, with Y being the length of the stick 
after the first break and X being the length after the second break. We calculated 
the mean of X as ¢/4, and now let us use the law of conditional variances to calculate 
var(X). We have E[X|Y] = Y/2, so since Y is uniformly distributed between 0 
and £, : ; 

per) 7 i o 7 e 


Also, since X is uniformly distributed between 0 and Y, we have 


var (ELX | Y]) = var(Y/2) = 


y?2 
xX|Y)=—. 
var(X | Y) 15 
Thus, since Y is uniformly distributed between 0 and 2, 


£ 2 
a geen a ere 
Ee ele a a! Y= 75 304 lo = 36° 


Using now the law of conditional variances, we obtain 


var(X) = E[var(X |Y)] + var(E[X |Y]) = sl ie Sa 
7 ~ 48 36 «(144° 


Example 4.19. Averaging Quiz Scores by Section — Variance. The setting 
is the same as in Example 4.17 and we consider the random variables 


X = quiz score of a student, 


Y =section of a student, (Y € {1,...,S}). 


Let n; be the number of students in section s, and let n be the total number of 
students. We interpret the different quantities in the formula 


var(X) = E[var(X | Y)] + var(E[X | Y]). 


In this context, var(X | Y = s) is the variance of the quiz scores within sec- 
tion s. Then, E|var(X | x) is the average of the section variances. This latter 
expectation is an average over the probability distribution of Y, i-e., 


Recall that ELX | Y = s] is the average score in section s. Then, var (E[X | Y)) 
is a measure of the variability of the averages of the different sections. The law of 
conditional variances states that the total quiz score variance can be broken into 
two parts: 
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(a) The average score variability E|var(X | Y)] within individual sections. 


(b) The variability var (B[X | Y]) between sections. 


We have seen earlier that the law of iterated expectations (in the form of the 
total expectation theorem) can be used to break down complicated expectation 
calculations, by considering different cases. A similar method applies to variance 
calculations. 


Example 4.20. Computing Variances by Conditioning. Consider a con- 
tinuous random variable X with the PDF given in Fig. 4.7. We define an auxiliary 
random variable Y as follows: 


1, if<1, 
¥={,' of x > 1. 


Here, E[X | Y] takes the values 1/2 and 3/2, with probabilities 1/3 and 2/3, respec- 
tively. Thus, the mean of E[X | Y] is 7/6. Therefore, 


var(@xl¥) = 3 (5-8) +38) =3 


1/3 


oy 


Figure 4.7: The PDF in Example 4.20. 


Conditioned on either value of Y, X is uniformly distributed on a unit length 
interval. Therefore, var(X | Y = y) = 1/12 for each of the two possible values of y, 
and E|var(X | Y)| = 1/12. Putting everything together, we obtain 


var(X) = E[var(X | Y)] + var(E[X|Y]) = 
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We summarize the main points in this section. 


The Mean and Variance of a Conditional Expectation 
e E[X|Y = y] is a number, whose value depends on y. 


e E[X | Y] is a function of the random variable Y, hence a random vari- 
able. Its experimental value is ELX | Y = y] whenever the experimental 
value of Y is y. 


EEX | Y]] = ELX] (law of iterated expectations). 


var(X | Y) isa random variable whose experimental value is var(X |Y = 
y), whenever the experimental value of Y is y. 


var(X) = E[var(X | Y)] + var(ELX | Y]). 


4.4 SUM OF A RANDOM NUMBER OF INDEPENDENT RANDOM 
VARIABLES 


In our discussion so far of sums of random variables, we have always assumed 
that the number of variables in the sum is known and fixed, i.e., it is nonrandom. 
In this section we will consider the case where the number of random variables 
being added is itself random. In particular, we consider the sum 


en ce eee ana 


where N isarandom variable that takes nonnegative integer values, and X1, X2,... 


are identically distributed random variables. We assume that N, X1, X2,... are 
independent, meaning that any finite subcollection of these random variables are 
independent. 


We first note that the randomness of N can affect significantly the character 
of the random sum Y = X) + ---+ Xw. In particular, the PMF/PDF of Y = 
peat Y; is much different from the PMF/PDF of the sum Y = Pay Y; where 
N has been replaced by its expected value (assuming that E/N] is integer). For 
example, let X; be uniformly distributed in the interval [0,1], and let N be 
equal to 1 or 3 with probability 1/2 each. Then the PDF of the random sum Y 
takes values in the interval [0,3], whereas if we replace N by its expected value 
E[N] = 2, the sum Y = X1 + XQ takes values in the interval [0,2]. Furthermore, 
using the total probability theorem, we see that the PDF of Y is a mixture of 
the uniform PDF and the PDF of Xi + X2+ X3, and has considerably different 
character than the triangular PDF of Y = X, + X2 which is given in Fig. 4.4. 

Let us denote by 4: and a? the common mean and the variance of the X;. 
We wish to derive formulas for the mean, variance, and the transform of Y . The 
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method that we follow is to first condition on the event N = n, under which we 
have the sum of a fixed number of random of random variables, a case that we 
already know how to handle. 

Fix some number n. The random variable X; +---+ Xn is independent of 
N and, therefore, independent of the event {N = n}. Hence, 


E[Y |N =n] =E[X.+---+Xn|N=n] 
=E[Xi +---+Xn|N =n 
= E[Xi +--+ Xn] 
= np. 


This is true for every nonnegative integer n and, therefore, 
E[Y | N] = Nu. 
Using the law of iterated expectations, we obtain 


E[Y] = B[E[Y | N]] = ElN] = pE[N), 


Similarly, 
var(Y | N =n) = var(X1+---+Xn|N=n) 
= var(X) +---+ Xn) 
=no?. 


Since this is true for every nonnegative integer n, the random variable var(Y | NV) 
is equal to No?. We now use the law of conditional variances to obtain 


var(Y) = E[var(Y | N)] + var(E[Y | NJ) 


= E[N]o? + var(Ny) 
= E[N]o?2 4+ p2var(N). 


The calculation of the transform proceeds along similar lines. The trans- 
form associated with Y, conditional on N = n, is E[e*Y | N = n]. However, condi- 
tioned on N = n, Y is the sum of the independent random variables X1,...,Xn, 
and 

ElesY | N = nj= Bles*1 ---e8Xnv | N= n| = BlesX1 . -esXn| 
= Eles*1] 2h2 BlesXn] y (Mx(s))”. 


Using the law of iterated expectations, the (unconditional) transform associated 
with Y is 


E[e*Y] = E[Ble*¥ | NJ] = E[(Mx(s))”] 
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This is similar to the transform My (s) = E[e*] associated with N, except that 
e* is replaced by Mx (s). 


Example 4.21. A remote village has three gas stations, and each one of them 
is open on any given day with probability 1/2, independently of the others. The 
amount of gas available in each gas station is unknown and is uniformly distributed 
between 0 and 1000 gallons. We wish to characterize the distribution of the total 
amount of gas available at the gas stations that are open. 

The number JN of open gas stations is a binomial random variable with p = 
1/2 and the corresponding transform is 


My(s) = (1—p+ pe’)? = (1 re ie 


The transform Mx(s) associated with the amount of gas available in an open gas 


station is 
el000s _ 4 


Mx(s) = T9005 


The transform associated with the total amount Y available is the same as Mn(s), 
except that each occurrence of e° is replaced with Mx(s), ie., 


1 1000s _ 4 3 
MS (: | ( 1000s 


Example 4.22. Sum of a Geometric Number of Independent Exponential 
Random Variables. Jane visits a number of bookstores, looking for Great Ex- 
pectations. Any given bookstore carries the book with probability p, independently 
of the others. In a typical bookstore visited, Jane spends a random amount of time, 
exponentially distributed with parameter , until she either finds the book or she 
decides that the bookstore does not carry it. Assuming that Jane will keep visiting 
bookstores until she buys the book and that the time spent in each is independent 
of everything else, we wish to determine the mean, variance, and PDF of the total 
time spent in bookstores. 

The total number N of bookstores visited is geometrically distributed with pa- 
rameter p. Hence, the total time Y spent in bookstores is the sum of a geometrically 
distributed number N of independent exponential random variables X1, X2,.... We 
have 


E[Y] = E[N]E[X] = 


Using the formulas for the variance of geometric and exponential random variables, 
we also obtain 


11.1 1-p 1 


var(Y) = E[N]var(X) + (E[X])’var(N) = oe 32 pe ep 
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In order to find the transform My(s), let us recall that 


eC ee Mey iS 


Then, My(s) is found by starting with My(s) and replacing each occurrence of e* 
with Mx(s). This yields 


which simplifies to 
_ _PA 
~ pAX—s° 


My (s) 


We recognize this as the transform of an exponentially distributed random variable 
with parameter pA, and therefore, 


fyy)=preP, — -y > 0. 


This result can be surprising because the sum of a fixed number n of indepen- 
dent exponential random variables is not exponentially distributed. For example, 
if n = 2, the transform associated with the sum is (A/A - s))’, which does not 
correspond to the exponential distribution. 


Example 4.23. Sum of a Geometric Number of Independent Geometric 
Random Variables. This example is a discrete counterpart of the preceding one. 
We let N be geometrically distributed with parameter p. We also let each random 
variable X; be geometrically distributed with parameter g. We assume that all of 
these random variables are independent. Let Y = X1 +---+ Xn. We have 


pe* qe 
MNG)=TSaapet “MF))=toaaoqer 


To determine My(s), we start with the formula for My(s) and replace each occur- 
rence of e* with Mx(s). This yields 


pMx(s) 
My(s) = , 
v9) = Tp) Mx) 
and, after some algebra, 
My(s) pqe 


Se payer: 


We conclude that Y is geometrically distributed, with parameter pq. 
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Properties of Sums of a Random Number of Independent Random 
Variables 


Let X1,X2,... be random variables with common mean p and common 
variance a”. Let N be a random variable that takes nonnegative integer 
values. We assume that all of these random variables are independent, and 
consider 
Y=H=Xi+-::+ Xn. 

Then, 

e BLY] = vB]. 

e var(Y) = o2E[N] + p2var(N). 

e The transform My(s) is found by starting with the transform My(s) 

and replacing each occurrence of e* with Mx(s). 


4.5 COVARIANCE AND CORRELATION 


The covariance of two random variables X and Y is denoted by cov(X, Y), and 
is defined by 
cov(X,Y) = B[(X - B[X])(Y - B[Y))]. 


When cov(X, Y) = 0, we say that X and Y are uncorrelated. 

Roughly speaking, a positive or negative covariance indicates that the val- 
ues of X — ELX] and Y — E[Y] obtained in a single experiment “tend” to have 
the same or the opposite sign, respectively (see Fig. 4.8). Thus the sign of the 
covariance provides an important qualitative indicator of the relation between 
X and Y. 

If X and Y are independent, then 


cov(X,Y) = E|(X — ELX])(Y — E[Y])] = E[X — E[X]]E[Y — E[Y]] =0. 


Thus if X and Y are independent, they are also uncorrelated. However, the 
reverse is not true, as illustrated by the following example. 


Example 4.24. The pair of random variables (X,Y) takes the values (1,0), (0,1), 
(—1,0), and (0,—1), each with probability 1/4 (see Fig. 4.9). Thus, the marginal 
PMFs of X and Y are symmetric around 0, and E[X] = E[Y] = 0. Furthermore, 
for all possible value pairs (x,y), either x or y is equal to 0, which implies that 
XY =0 and E[XY] =0. Therefore, 


cov(X, Y) = E[(X — E[X]) (Y — E[Y])] = E[XY] =0, 
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y y 


(a) (b) 


Figure 4.8: Examples of positively and negatively correlated random variables. 
Here X and Y are uniformly distributed over the ellipses shown. In case (a) the 
covariance cov(X, Y) is negative, while in case (b) it is positive. 


Figure 4.9: Joint PMF of X and Y 
for Example 4.21. Each of the four 
points shown has probability 1/4. Here 
X and Y are uncorrelated but not in- 
dependent. 


and X and Y are uncorrelated. However, X and Y are not independent since, for 
example, a nonzero value of X fixes the value of Y to zero. 


The correlation coefficient p of two random variables X and Y that have 
nonzero variances is defined as 
cov(X, Y) 
var(X)var(Y) 
It may be viewed as a normalized version of the covariance cov(X, Y), and in fact 
it can be shown that p ranges from —1 to 1 (see the end-of-chapter problems). 
If p > 0 (or p < 0), then the values of « — E[X] and y — E[Y] “tend” 
to have the same (or opposite, respectively) sign, and the size of |p| provides a 
normalized measure of the extent to which this is true. In fact, always assuming 
that X and Y have positive variances, it can be shown that p = 1 (or p = —1) 
if and only if there exists a positive (or negative, respectively) constant c such 
that 


y — E[Y] = c(2 — E[X]), for all possible numerical values (x, y) 
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(see the end-of-chapter problems). The following example illustrates in part this 
property. 


Example 4.25. Consider n independent tosses of a biased coin with probability of 
a head equal to p. Let X and Y be the numbers of heads and of tails, respectively, 
and let us look at the correlation of X and Y. Here, for all possible pairs of values 
(x,y), we have x + y = n, and we also have E[X] + E[Y] =n. Thus, 

x — E[X] = -(y - E(Y)), for all possible (2, y). 


We will calculate the correlation coefficient of X and Y, and verify that it is indeed 


equal to —1. 
We have 
cov(X,Y) = E[(X — E[X])(Y — E[Y])] 
= -E[(X - E[X])?] 
= —var(X). 
Hence, the correlation coefficient is 
(X,Y) cov(X, Y) —var(X) oe 


7 /var(X)var(Y) 7 /var(X)var(X) 


The covariance can be used to obtain a formula for the variance of the 
sum of several (not necessarily independent) random variables. In particular, if 


X 1, X2,...,Xn are random variables with finite variance, we have 
n n n 
var (>: x) = S° var(X;) +2 S- cov(X4, Xj). 
i=1 i=1 i,j=l 
i<j 


This can be seen from the following calculation, where for brevity, we denote 
X; = X; = E|X;]: 


var(Xi) +2 S© cov(Xi, Xj). 


1 ij=l 
i<j 
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The following example illustrates the use of this formula. 


Example 4.26. Consider the hat problem discussed in Section 2.5, where n 
people throw their hats in a box and then pick a hat at random. Let us find the 
variance of X, the number of people that pick their own hat. We have 


X=X14+---+Xn, 


where X; is the random variable that takes the value 1 if the ith person selects 
his/her own hat, and takes the value 0 otherwise. Noting that X; is Bernoulli with 
parameter p = P(X; = 1) = 1/n, we obtain 


var(X;) = Z (1 - ~) : 


n n 


For i 4 7, we have 


cov(Xi, Xj) = E[(Xi — E[Xi]) (Xj — ELX;])] 
= E[X; Xj] — E[X:]E[X5] 
= P(X; =1 and X; = 1) — P(X; = 1)P(X; =1) 
= P(X; = 1)P(X; = 1| X; = 1) — P(X; = 1)P(X; = 1) 
eri ee eee 
nn-1 nn? 
= 1 
~ n2(n—1) 
Therefore 


= S- var(X;) + 2 S- cov(X;, X;) 
i=l ij=l 
i<j 


re (1 *) | ymin 1) ain 5 


4.6 LEAST SQUARES ESTIMATION 


In many practical contexts, we want to form an estimate of the value of a random 
variable X given the value of a related random variable Y, which may be viewed 
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as some form of “measurement” of X. For example, X may be the range of 
an aircraft and Y may be a noise-corrupted measurement of that range. In 
this section we discuss a popular formulation of the estimation problem, which 
is based on finding the estimate c that minimizes the expected value of the 
squared error (X — c)? (hence the name “least squares” ). 

If the value of Y is not available, we may consider finding an estimate (or 
prediction) c of X. The estimation error X —c is random (because X is random), 
but the mean squared error E[(X —c)?] is a number that depends on c and can 
be minimized over c. With respect to this criterion, it turns out that the best 
possible estimate is c = E[X], as we proceed to verify. 

Let m = E[X]. For any estimate c, we have 


X —m)?| +(m-—c)?, 


where we used the fact EL[X — m] = 0. The first term in the right-hand side 
is the variance of X and is unaffected by our choice of c. Therefore, we should 
choose c in a way that minimizes the second term, which leads to c= m= E[X] 
(see Fig. 4.10). 


Expected Squared 
Estimation Error 
E[(X- c)?] 


Figure 4.10: The mean squared error E [(x —c)?], as a function of the estimate 
c, is a quadratic in c and is minimized when c = E[X]. The minimum value of 
the mean squared error is var(X). 


Suppose now that we observe the experimental value y of some related 
random variable Y, before forming an estimate of X. How can we exploit this 
additional information? Once we are told that Y takes a particular value y, the 
situation is identical to the one considered earlier, except that we are now in a 
new “universe,” where everything is conditioned on Y = y. We can therefore 
adapt our earlier conclusion and assert that c = E[X |Y = y] minimizes the 
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conditional mean squared error E[(c — X)?2|Y = y]. Note that the resulting 
estimate c depends on the experimental value y of Y (as it should). Thus, we 
call E[X | Y = y] the least-squares estimate of X given the experimental value y. 


Example 4.27. Let X be uniformly distributed in the interval [4, 10] and suppose 
that we observe X with some random error W, that is, we observe the experimental 
value of the random variable 


Y=X+W. 


We assume that W is uniformly distributed in the interval [—1, 1], and independent 
of X. What is the least squares estimate of X given the experimental value of Y? 

We have fx(x) = 1/6 for 4 < x < 10, and fx(x) = 0, elsewhere. Conditioned 
on X being equal to some xz, Y is the same as x + W, and is uniform over the 
interval [x — 1,2 +1]. Thus, the joint PDF is given by 


Ixy (2, y) = fx (a) fyix(y| 2) = 


if4<a2<10and 2-1 < y < «+1, and is zero for all other values of (2, y). 
The slanted rectangle in the right-hand side of Fig. 4.11 is the set of pairs (a, y) for 
which fx,y(x,y) is nonzero. 

Given an experimental value y of Y, the conditional PDF fx \y of X is uniform 
on the corresponding vertical section of the slanted rectangle. The optimal estimate 
E[X|Y = y] is the midpoint of that section. In the special case of the present 
example, it happens to be a piecewise linear function of y. 


x 
Y=X+W 10 
where W is a measurement 
error that is uniformly 
distributed in the interval [-1,1] 
fy{x) L) 
16 4 Least squares estimate 
E[xX| Y=y] 
3 9 tt 
x 5 
4 10 y 


Figure 4.11: The PDFs in Example 4.27. The least squares estimate of X given 
the experimental value y of the random variable Y = X + W depends on y and 
is represented by the piecewise linear function shown in the figure on the right. 


As Example 4.27 illustrates, the estimate E[X|Y = y] depends on the 
observed value y and should be viewed as a function of y; see Fig. 4.12. To 
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amplify this point, we refer to any function of the available information as an 
estimator. Given an experimental outcome y of Y, an estimator g(-) (which is 
a function) produces an estimate g(y) (which is a number). However, if y is left 


unspecified, then the estimator results in a random variable g(Y). The expected 
value of the squared estimation error associated with an estimator g(Y) is 


E|(x - 9(Y))’]. 


Out of all estimators, it turns out that the mean squared estimation error 
is minimized when g(Y) = E[X | Y]. To see this, note that if c is any number, 
we have 


E|(X -B[X|¥ =y))” | ¥=y] <B[(X-o1¥ =y]. 


Consider now an estimator g(Y). For a given value y of Y, g(y) is a number 
and, therefore, 


2 2 
E|(X —BIX|¥ = yl)" 1¥ =y] <B[(X - 9)" | Y =y]. 
This inequality is true for every possible experimental value y of Y. Thus, 
2 2 
E|(X — E[X|Y]) | Y| < B|(X - 9(¥)) | ar 


which is now an inequality between random variables (functions of Y). We take 
expectations of both sides, and use the law of iterated expectations, to conclude 
that 


E|(X —B[X|¥])"| < E[(X -9(%))”] 


for all functions g(Y). 


E[X |Y= 
ve LEAST SQUARES soileg 


ESTIMATOR 


Figure 4.12: The least squares estimator. 
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Key Facts about Least Mean Squares Estimation 


e E[(X —c)?] is minimized when c = E[X]: 
E|(X -E[x])"] <E[(X- 0], foralle. 
e E[(X —c)?|Y =y] is minimized when c = E[X |Y = yj: 
B|(X —E[X|¥ =y))’ | Y=y] <E[(X —0)?2|Y =y], foralle. 


e Out of all estimators g(Y) of X based on Y, the mean squared esti- 
mation error E (x — oY)" | is minimized when g(Y) = E[X | Y]: 


B(x _E[X| Y])"] < B|(x - o(¥))"], for all functions g(Y). 


Some Properties of the Estimation Error 
Let us introduce the notation 
X =E[X|Y], X=X-X, 
for the (optimal) estimator and the associated estimation error, respectively. 
Note that both X and X are random variables, and by the law of iterated 
expectations, 
E[X] = E[X — E[X | Y]] = E[X] - E[X] =0. 
The equation E[X] = 0 remains valid even if we condition on Y, because 
E[X | Y] = E[X — X|Y] =E[X|Y] -E[X|y] = X - X =0. 


We have used here the fact that X is completely determined by Y and therefore 
E[X | Y] = X. For similar reasons, 


E[(X —E[X])X|Y] = (X —-E[X])E[X | Y] =0. 
Taking expectations and using the law of iterated expectations, we obtain 


E[(X — E[X])X] =0. 
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Note that X = X +X, which yields X — E[X] = X — E[X] +X. We square 
both sides of the latter equality and take expectations to obtain 


var(X) = B[(X — E[X])” 
=| (X - E[x]+ X)’| 
=E | (X - E[X])"] + B[X?] + 2B[(X - E[X]) x] 
=§ [(X -E[X))”] + E[X?] 
( 


(The last equality holds because ELX] = ELX] and ELX] = 0.) In summary, we 
have established the following important formula, which is just another version 
of the law of conditional variances introduced in Section 4.3. 


var(X) = var(X) + var(X). 


Example 4.28. Let us say that the observed random variable Y is uninformative if 
the mean squared estimation error ELX?] = var(X) is the same as the unconditional 
variance var(X) of X. When is this the case? 

Using the formula 


var(X) = var(X) + var(X), 


we see that Y is uninformative if and only if var (x ) = 0. The variance of a random 
variable is zero if and only if that random variable is a constant, equal to its mean. 
We conclude that Y is uninformative if and only if X = E[X | Y] = E[X], for every 
realization of Y. 

If X and Y are independent, we have E[X |Y] = E[X] and Y is indeed 
uninformative, which is quite intuitive. The converse, however, is not true. That 
is, it is possible for E[X | Y] to be always equal to the constant E[X], without X 
and Y being independent. (Can you construct an example?) 


Estimation Based on Several Measurements 


So far, we have discussed the case where we estimate one random variable X 
on the basis of another random variable Y. In practice, one often has access 
to the experimental values of several random variables Yj,..., Yn, that can be 
used to estimate X. Generalizing our earlier discussion, and using essentially 
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the same argument, the mean squared estimation error is minimized if we use 
E[X |VYi,..., Yn] as our estimator. That is, 


E[(X -E[X|¥i,...,¥al)”| < E[(X — 9(Mi,-.-.¥n))’], 


for all functions g(Y1,...,Yn). 


This provides a complete solution to the general problem of least squares 
estimation, but is sometimes difficult to implement, because: 


(a) In order to compute the conditional expectation E[X | Yi,..., Yn], we need 


peeegy 


random variables. 


(b) Even if this joint PDF is available, E[|X | Yi,..., Yn] can be a very compli- 
cated function of Yi,..., Yn. 


As a consequence, practitioners often resort to approximations of the conditional 
expectation or focus on estimators that are not optimal but are simple and easy 
to implement. The most common approach involves linear estimators, of the 
form 

a1Yi +++: +anYn t+ 0. 


Given a particular choice of a1,...,@n,5, the corresponding mean squared error 
is 

E[(X — a1¥i — +++ —anYn —6)?], 
and it is meaningful to choose the coefficients a1,...,@n,b in a way that min- 


imizes the above expression. This problem is relatively easy to solve and only 
requires knowledge of the means, variances, and covariances of the different ran- 
dom variables. We develop the solution for the case where n = 1. 


Linear Least Mean Squares Estimation Based on a Single Measurement 


We are interested in finding a and b that minimize the mean squared estimation 
error E|(X —aY —b)?], associated with a linear estimator aY +b of X. Suppose 
that a has already been chosen. How should we choose b? This is the same as 
having to choose a constant b to estimate the random variable aX — Y and, by 
our earlier results, the best choice is to let 6 = ELX — aY] = E[X] — aE[Y]. 

It now remains to minimize, with respect to a, the expression 


B(x —aY —E[X]+ aE[Y))’I, 


which is the same as 


B[((X - E[X]) - a(¥ - E[Y))”| 
= B[(X — E[X])?] + @E[(¥ — ElY])?] — 20 [(X - E[X]) (¥ -E[Y))] 


= 0% + a2o2, — 2a: cov(X,Y), 
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where cov(X,Y) is the covariance of X and Y: 
cov(X,Y) = B[(X — E[X])(¥ - E[Y))]. 


This is a quadratic function of a, which is minimized at the point where its 
derivative is zero, that is, if 


cov(X,Y)  poxay ox 
a= = — 
2 2 ’ 
oy oY oy 
where 
cov(X,Y) 
OxOy 


is the correlation coefficient. With this choice of a, the mean squared estimation 
error is given by 


2 


o o 
o% + aro? — 2a- cov(X,Y) =o% + p? +o? — 29— poxoy 
Oy Oy 
=(1 — p?)o%. 


Linear Least Mean Squares Estimation Formulas 

The least mean squares linear estimator of X based on Y is 

cov(X, Y) 
2 


E[X] + 
Oy 


(Y - E[Y}). 


The resulting mean squared estimation error is equal to 


(1 — p?)var(X). 


4.7 THE BIVARIATE NORMAL DISTRIBUTION 
We say that two random variables X and Y have a bivariate normal distribution 
if there are two independent normal random variables U and V and some scalars 


a,b, c,d, such that 


X=aU +WV, Y=cU +dV. 
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To keep the discussion simple, we restrict ourselves to the case where U, V (and 
therefore, X and Y as well) have zero mean. 

A most important property of the bivariate normal distribution is the fol- 
lowing: 


If two random variables X and Y have a bivariate normal distribution and 
are uncorrelated, then they are independent. 


This property can be verified using multivariate transforms. We assume 
that X and Y have a bivariate normal distribution and are uncorrelated. Recall 
that if z is a zero-mean normal random variable with variance 07, then Efe7] = 
Mz (1) = o?, /2. Fix some scalars s1, s2 and let Z = s1X + s2Y. Then, Z is the 
sum of the independent normal random variables (as1+cs2)U and (bs1 + ds2)V, 
and is therefore normal. Since X and Y are uncorrelated, the variance of Z is 
870% + szo%. Then, 


Mx y(s1, $2) =E [esiX+s2¥] 
= Ble 


2 2 2.2 
= e(81ex +820y)/2: 


Let X and Y be independent zero-mean normal random variables with the same 
variances 0% and a? as X and Y. Since they are independent, they are uncor- 
related, and the same argument as above yields 


Thus, the two pairs of random variables (X,Y) and (X,Y) are associated with 
the same multivariate transform. Since the multivariate transform completely 
determines the joint PDF, it follows that the pair (X,Y) has the same joint 
PDF as the pair (X,Y). Since X and Y are independent, X and Y must also 
be independent. 

Let us define 


X= 


Thus, X is the best linear estimator of X given Y, and X is the estimation error. 
Since X and Y are linear combinations of independent normal random variables 
U and V, it follows that Y and X are also linear combinations of U and V. In 
particular, Y and X have a bivariate normal distribution. Furthermore, 


cov(Y, X) = E[Y X] = E[Y.X] — E[Y X] = E[Y x] - aye =), 
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Thus, Y and X are uncorrelated and, therefore, independent. Since X is ascalar 
multiple of Y, we also see that X and X are independent. 
We now start from the identity 


X=X+X, 


which implies that 
E[X|Y] = E[X|Y]+E[X|Y]. 


But E[X |Y] =X because X is completely determined by Y. Also, X is inde- 
pendent of Y and 


(The last equality was obtained because X and Y are assumed to have zero mean 
and X is a constant multiple of Y.) Putting everything together, we come to 
the important conclusion that the best linear estimator X is of the form 


X =E[X|Y]. 


Differently said, the optimal estimator E[X | Y] turns out to be linear. 

Let us now determine the conditional density of X, conditioned on Y. We 
have X = X + X. After conditioning on Y, the value of the random variable 
X is completely determined. On the other hand, X is independent of Y and its 
distribution is not affected by conditioning. T ferelore, the conditional distribu- 
tion of X given Y is the same as the cereNon of X, shifted by X. Since X is 
normal with mean zero and some variance o2 ye we eonclide that the conditional 


distribution of X is also normal with mean X and variance ot. 


We summarize our conclusions below. Although our discussion used the 
zero-mean assumption, these conclusions also hold for the non-zero mean case 
and we state them with this added generality. 


Properties of the Bivariate Normal Distribution 
Let X and Y have a bivariate normal distribution. Then: 
e X and Y are independent if and only if they are uncorrelated. 


e The conditional expectation is given by 


BLX|¥] = Bx] + SE) y _ ey). 
Oy 


It is a linear function of Y and has a normal distribution. 
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e The conditional distribution of X given Y is normal with mean E[X | Y] 
and variance 


Finally, let us note that while if X and Y have a bivariate normal distri- 
bution, then X and Y are (individually) normal random variables, the reverse 
is not true even if X and Y are uncorrelated. This is illustrated in the following 
example. 


Example 4.29. Let X have a normal distribution with zero mean and unit 
variance. Let z be independent of X, with P(Z = 1) = P(Z 1) = 1/2. Let 
Y = ZX, which is also normal with zero mean (why?). Furthermore, 


E[XY] = E[ZX?] = E[ZJE[X?] =0 x 1=0, 


so X and Y are uncorrelated. On the other hand X and Y are clearly dependent. 
(For example, if X = 1, then Y must be either —1 or 1.) This may seem to contradict 
our earlier conclusion that zero correlation implies independence? However, in this 
example, the joint PDF of X and Y is not multivariable normal, even though both 
marginal distributions are normal. 


Stochastic Processes 


5.2. The Poisson Process 
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Stochastic Processes Chap. 5 


A stochastic process is a mathematical model of a probabilistic experiment that 
evolves in time and generates a sequence of numerical values. For example, a 
stochastic process can be used to model: 


(a) 
(b) 
(c) 
(d) 
(e) 


the sequence of daily prices of a stock; 
the sequence of scores in a football game; 
the sequence of failure times of a machine; 


the sequence of hourly traffic loads at a node of a communication network; 


the sequence of radar measurements of the position of an airplane. 


Each numerical value in the sequence is modeled by a random variable, so a 
stochastic process is simply a (finite or infinite) sequence of random variables 
and does not represent a major conceptual departure from our basic framework. 
We are still dealing with a single basic experiment that involves outcomes gov- 
erned by a probability law, and random variables that inherit their probabilistic 
properties from that law.t However, stochastic processes involve some change in 
emphasis over our earlier models. In particular: 


(a) 


(b) 


We tend to focus on the dependencies in the sequence of values generated 
by the process. For example, how do future prices of a stock depend on 
past values? 


We are often interested in long-term averages, involving the entire se- 
quence of generated values. For example, what is the fraction of time that 
a machine is idle? 


We sometimes wish to characterize the likelihood or frequency of certain 
boundary events. For example, what is the probability that within a 
given hour all circuits of some telephone system become simultaneously 
busy, or what is the frequency with which some buffer in a computer net- 
work overflows with data? 


In this book, we will discuss two major categories of stochastic processes. 


Arrival-Type Processes: Here, we are interested in occurrences that have 
the character of an “arrival,” such as message receptions at a receiver, job 
completions in a manufacturing cell, customer purchases at a store, etc. 
We will focus on models in which the interarrival times (the times between 
successive arrivals) are independent random variables. In Section 5.1, we 
consider the case where arrivals occur in discrete time and the interarrival 
times are geometrically distributed — this is the Bernoulli process. In Sec- 
tion 5.2, we consider the case where arrivals occur in continuous time and 


7 Let us emphasize that all of the random variables arising in a stochastic process 


refer to a single and common experiment, and are therefore defined on a common 
sample space. The corresponding probability law can be specified directly or indirectly 
(by assuming some of its properties), as long as it unambiguously determines the joint 
CDF of any subset of the random variables involved. 
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the interarrival times are exponentially distributed — this is the Poisson 
process. 


(b) Markov Processes: Here, we are looking at experiments that evolve in time 
and in which the future evolution exhibits a probabilistic dependence on 
the past. As an example, the future daily prices of a stock are typically 
dependent on past prices. However, in a Markov process, we assume a very 
special type of dependence: the next value depends on past values only 
through the current value. There is a rich methodology that applies to 
such processes, and which will be developed in Chapter 6. 


5.1 THE BERNOULLI PROCESS 


The Bernoulli process can be visualized as a sequence of independent coin tosses, 
where the probability of heads in each toss is a fixed number p in the range 
0<p< 1. In general, the Bernoulli process consists of a sequence of Bernoulli 
trials, where each trial produces a 1 (a success) with probability p, and a 0 (a 
failure) with probability 1 — p, independently of what happens in other trials. 

Of course, coin tossing is just a paradigm for a broad range of contexts 
involving a sequence of independent binary outcomes. For example, a Bernoulli 
process is often used to model systems involving arrivals of customers or jobs at 
service centers. Here, time is discretized into periods, and a “success” at the kth 
trial is associated with the arrival of at least one customer at the service center 
during the kth period. In fact, we will often use the term “arrival” in place of 
“success” when this is justified by the context. 

In a more formal description, we define the Bernoulli process as a sequence 
X1,X2,... of independent Bernoulli random variables X; with 


1) = P(success at the ith trial) = p, 
P(X; = 0) = P(failure at the ith trial) = 1 — p, 


for each i.t 

Given an arrival process, one is often interested in random variables such 
as the number of arrivals within a certain time period, or the time until the first 
arrival. For the case of a Bernoulli process, some answers are already available 
from earlier chapters. Here is a summary of the main facts. 


{ Generalizing from the case of a finite number of random variables, the inde- 
pendence of an infinite sequence of random variables X; is defined by the requirement 
that the random variables X1,..., Xn be independent for any finite n. Intuitively, 
knowing the experimental values of any finite subset of the random variables does not 
provide any new probabilistic information on the remaining random variables, and the 
conditional distribution of the latter stays the same as the unconditional one. 
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Some Random Variables Associated with the Bernoulli Process 
and their Properties 


e The binomial with parameters p and n. This is the number S of 
successes in n independent trials. Its PMF, mean, and variance are 


n 
vsti) = (Tepe, k= 0.1... 


E[S] = np, var (S') = np(1 — p). 


e The geometric with parameter p. This is the number T of trials 
up to (and including) the first success. Its PMF, mean, and variance 
are 

pr(t) = (1 — p)t-1p, $a Deine, 


Independence and Memorylessness 


The independence assumption underlying the Bernoulli process has important 
implications, including a memorylessness property (whatever has happened in 
past trials provides no information on the outcomes of future trials). An appreci- 
ation and intuitive understanding of such properties is very useful, and allows for 
the quick solution of many problems that would be difficult with a more formal 
approach. In this subsection, we aim at developing the necessary intuition. 

Let us start by considering random variables that are defined in terms of 
what happened in a certain set of trials. For example, the random variable 
Z = (X1 + X3)X6X7 is defined in terms of the first, third, sixth, and seventh 
trial. If we have two random variables of this type and if the two sets of trials 
that define them have no common element, then these random variables are 
independent. This is a generalization of a fact first seen in Chapter 2: if two 
random variables U and V are independent, then any two functions of them, 
g(U) and h(V), are also independent. 


Example 5.1. 


(a) Let U be the number of successes in trials 1 to 5. Let V be the number of 
successes in trials 6 to 10. Then, U and V are independent. This is because 
U = X14+---+X5, V = Xe6+---+Xi0, and the two collections {X1,..., Xs}, 
{X6,..., X10} have no common elements. 
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(b) Let U (respectively, V) be the first odd (respectively, even) time 7 in which we 
have a success. Then, U is determined by the odd-time sequence Xi, X3,..., 
whereas V is determined by the even-time sequence X2, X4,.... Since these 
two sequences have no common elements, U and V are independent. 


Suppose now that a Bernoulli process has been running for n time steps, 
and that we have observed the experimental values of X1, X2,...,Xn. We no- 
tice that the sequence of future trials Xn+1, Xn+2,... are independent Bernoulli 
trials and therefore form a Bernoulli process. In addition, these future trials are 
independent from the past ones. We conclude that starting from any given point 
in time, the future is also modeled by a Bernoulli process, which is independent 
of the past. We refer to this as the fresh-start property of the Bernoulli process. 

Let us now recall that the time T until the first success is a geometric 
random variable. Suppose that we have been watching the process for n time 
steps and no success has been recorded. What can we say about the number T’—n 
of remaining trials until the first success? Since the future of the process (after 
time n) is independent of the past and constitutes a fresh-starting Bernoulli 
process, the number of future trials until the first success is described by the 
same geometric PMF. Mathematically, we have 


P(T-—n=t|T>n) =(1—-p)t!p=P(T = 28), b= 12 ees 


This memorylessness property can also be derived algebraically, using the 
definition of conditional probabilities, but the argument given here is certainly 
more intuitive. 


Memorylessness and the Fresh-Start Property of the Bernoulli 
Process 


e The number T — n of trials until the first success after time n has 
a geometric distribution with parameter p, and is independent of the 
past. 


e For any given time n, the sequence of random variables Xn+1, Xn+2,... 
(the future of the process) is also a Bernoulli process, and is indepen- 
dent from X1,...,Xp (the past of the process). 


The next example deals with an extension of the fresh-start property, in 
which we start looking at the process at a random time, determined by the past 
history of the process. 


Example 5.2. Let N be the first time in which we have a success immediately 
following a previous success. (That is, N is the first i for which X;-1 = X; = 1.) 
What is the probability P(Xn+1 = Xw+2 = 0) that there are no successes in the 
two trials that follow? 


6 Stochastic Processes Chap. 5 


Intuitively, once the condition Xv-1 = Xw = 1 is satisfied, from then on, 
the future of the process still consists of independent Bernoulli trials. Therefore the 
probability of an event that refers to the future of the process is the same as in a 
fresh-starting Bernoulli process, so that P(Xn+41 = Xn+2 = 0) = (1—p)?. 

To make this argument precise, we argue that the time N is a random variable, 
and by conditioning on the possible values of N, we have 


P(Xw41 = Xn42 =0) = S0 PIN =n)P(Xwii = Xn42 =0|N =n) 


= S 0 P(N =n)P(Xn41 = Xn42 = 0|N =n) 


Because of the way that N was defined, the event {N = n} occurs if and only if 
the experimental values of X1,...,Xn satisfy a certain condition. But the latter 
random variables are independent of Xn+1 and Xn+2. Therefore, 


P(Xnti = Xnt2 = 0|N =n) = P(Xnti = Xnt2 = 0) = (1—p)’, 


which leads to 


P(Xw4i = Xn42 =0) = P(N =n)(1—p)’ = (1-3). 


Interarrival Times 


An important random variable associated with the Bernoulli process is the time 
of the kth success, which we denote by Y;. A related random variable is the kth 
interarrival time, denoted by 7}. It is defined by 


T, =Vi, Th =Ye—Ye-1,  k=2,3,.-. 


and represents the number of trials following the & — 1st success until the next 
success. See Fig. 5.1 for an illustration, and also note that 


Y,y =T™4+To+---+ Th. 


a al 


O7FO]1 JOJO [O JO }17OF1]1 JOJO 


Ty To T3 Ty Time 


Figure 5.1: Illustration of interarrival times. In this example, T] = 3, T2 = 5, 
T3 = 2, Ty = 1. Furthermore, Y; = 3, Yo = 8, Y3 = 10, Y4 = 11. 
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We have already seen that the time JT; until the first success is a geometric 
random variable with parameter p. Having had a success at time 7), the future 
is a fresh-starting Bernoulli process. Thus, the number of trials T2 until the 
next success has the same geometric PMF. Furthermore, past trials (up to and 
including time T;) are independent of future trials (from time T, + 1 onward). 
Since Ty is determined exclusively by what happens in these future trials, we 
see that T2 is independent of 7;. Continuing similarly, we conclude that the 
random variables 7), 72, 73,... are independent and all have the same geometric 
distribution. 

This important observation leads to an alternative, but equivalent way of 
describing the Bernoulli process, which is sometimes more convenient to work 
with. 


Alternative Description of the Bernoulli Process 


1. Start with a sequence of independent geometric random variables T1, 
T2,..., with common parameter p, and let these stand for the interar- 
rival times. 


2. Record a success (or arrival) at times T), T; + T2, T; + T2 + T3, etc. 


Example 5.3. A computer executes two types of tasks, priority and nonpriority, 
and operates in discrete time units (slots). A priority task arises with probability 
p at the beginning of each slot, independently of other slots, and requires one full 
slot to complete. A nonpriority task is executed at a given slot only if no priority 
task is available. In this context, it may be important to know the probabilistic 
properties of the time intervals available for nonpriority tasks. 

With this in mind, let us call a slot busy if within this slot, the computer 
executes a priority task, and otherwise let us call it idle. We call a string of idle 
(or busy) slots, flanked by busy (or idle, respectively) slots, an idle period (or busy 
period, respectively). Let us derive the PMF, mean, and variance of the following 
random variables (cf. Fig. 5.2): 


(a) T = the time index of the first idle slot; 
(b) B = the length (number of slots) of the first busy period; 
(c) I = the length of the first idle period. 


We recognize T as a geometrically distributed random variable with param- 
eter 1 — p. Its PMF is 
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Figure 5.2: Illustration of busy (B) and idle (1) periods in Example 5.3. In 
the top diagram, T = 4, B = 3, and J = 2. In the bottom diagram, T = 1, 
IT=5,and B=4. 


Let us now consider the first busy period. It starts with the first busy slot, 
call it slot L. (In the top diagram in Fig. 5.2, L = 1; in the bottom diagram, L = 6.) 
The number Z of subsequent slots until (and including) the first subsequent idle 
slot has the same distribution as T, because the Bernoulli process starts fresh at 
time L + 1. We then notice that Z = B and conclude that B has the same PMF 
as T. 

If we reverse the roles of idle and busy slots, and interchange p with 1—p, we 
see that the length J of the first idle period has the same PMF as the time index 
of the first busy slot, so that 


= 1 1-— 
pr(k)=(1—p)**p, k=1,2,...,  BU]=—, var(t) = “4 


We finally note that the argument given here also works for the second, third, 
etc. busy (or idle) period. Thus the PMFs calculated above apply to the ith busy 
and idle period, for any i. 


The kth Arrival Time 


The time Y; of the kth success is equal to the sum Y, = 71 + To +---+ Ty of k 
independent identically distributed geometric random variables. This allows us 
to derive formulas for the mean, variance, and PMF of Y;, which are given in 
the table that follows. 
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Properties of the kth Arrival Time 


e The kth arrival time is equal to the sum of the first & interarrival times 
¥e=T1+7T2+---+T,, 


and the latter are independent geometric random variables with com- 
mon parameter p. 


e The mean and variance of Y; are given by 


E/Y;] = ET; ] ean E(T;| = ., 


var(Y,) = var(T1) +--+ + var(T,) = ——,— 


e The PMF of Y,; is given by 


t-1 
pi()= (fy )O=p t= kth 


and is known as the Pascal PMF of order k. 


To verify the formula for the PMF of Y;, we first note that Y, cannot be 
smaller than k. For t > k, we observe that the event {Y;, = t} (the kth success 
comes at time t) will occur if and only if both of the following two events A and 
B occur: 


(a) event A: trial t is a success; 
(b) event B: exactly k — 1 successes occur in the first t — 1 trials. 


The probabilities of these two events are 
P(A) = p 


and oe : 
P(B) = (a Tepe 


respectively. In addition, these two events are independent (whether trial t is a 
success or not is independent of what happened in the first t—1 trials). Therefore, 


py, (t) = P(Y = t) = P(AN B) = P(A)P(B) = (era ete 


as claimed. 
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Example 5.4. In each minute of basketball play, Alice commits a single foul with 
probability p and no foul with probability 1 — p. The number of fouls in different 
minutes are assumed to be independent. Alice will foul out of the game once she 
commits her sixth foul, and will play 30 minutes if she does not foul out. What is 
the PMF of Alice’s playing time? 

We model fouls as a Bernoulli process with parameter p. Alice’s playing time 
Z is equal to Ye, the time until the sixth foul, except if Y¢ is larger than 30, in which 
case, her playing time is 30, the duration of the game; that is, Z = min{Y6, 30}. 
The random variable Ygé has a Pascal PMF of order 6, which is given by 


t—1)\ 6 = 
matt) = ( 5 any 8 t=6,7,... 


To determine the PMF pz(z) of Z, we first consider the case where z is between 6 
and 29. For z in this range, we have 


z—1 
5 


pz(z) = P(Z =z) = P(¥6 = z) ( )ea—ar" BS ONT2 12729: 


The probability that Z = 30 is then determined from 


29 
pz(30) =1— 5° pz(z). 
2=6 


Splitting and Merging of Bernoulli Processes 


Starting with a Bernoulli process in which there is a probability p of an arrival 
at each time, consider splitting it as follows. Whenever there is an arrival, we 
choose to either keep it (with probability q), or to discard it (with probability 
1—q); see Fig. 5.3. Assume that the decisions to keep or discard are independent 
for different arrivals. If we focus on the process of arrivals that are kept, we see 
that it is a Bernoulli process: in each time slot, there is a probability pq of a 
kept arrival, independently of what happens in other slots. For the same reason, 
the process of discarded arrivals is also a Bernoulli process, with a probability 
of a discarded arrival at each time slot equal to p(1 — q). 

In a reverse situation, we start with two independent Bernoulli processes 
(with parameters p and gq, respectively) and merge them into a single process, 
as follows. An arrival is recorded in the merged process if and only if there 
is an arrival in at least one of the two original processes, which happens with 
probability p + q — pq [one minus the probability (1 — p)(1 — q) of no arrival in 
either process.] Since different time slots in either of the original processes are 
independent, different slots in the merged process are also independent. Thus, 
the merged process is Bernoulli, with success probability p+ q— pq at each time 
step; see Fig. 5.4. 
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Figure 5.3: Splitting of a Bernoulli process. 
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Figure 5.4: Merging of independent Bernoulli process. 


Splitting and merging of Bernoulli (or other) arrival processes arises in 
many contexts. For example, a two-machine work center may see a stream of 
arriving parts to be processed and split them by sending each part to a randomly 
chosen machine. Conversely, a machine may be faced with arrivals of different 
types that can be merged into a single arrival stream. 


The Poisson Approximation to the Binomial 


The number of successes in n independent Bernoulli trials is a binomial random 
variable with parameters n and p, and its mean is np. In this subsection, we 
concentrate on the special case where n is large but p is small, so that the mean 
np has a moderate value. A situation of this type arises when one passes from 
discrete to continuous time, a theme to be picked up in the next section. For 
some more examples, think of the number of airplane accidents on any given day: 
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there is a large number of trials (airplane flights), but each one has a very small 
probability of being involved in an accident. Or think of counting the number of 
typos in a book: there is a large number n of words, but a very small probability 
of misspelling each one. 

Mathematically, we can address situations of this kind, by letting n grow 
while simultaneously decreasing p, in a manner that keeps the product np at a 
constant value X. In the limit, it turns out that the formula for the binomial PMF 
simplifies to the Poisson PMF. A precise statement is provided next, together 
with a reminder of some of the properties of the Poisson PMF that were derived 
in earlier chapters. 


Poisson Approximation to the Binomial 


e A Poisson random variable Z with parameter takes nonnegative 
integer values and is described by the PMF 


pz(k) = e->— k=0,1,2,.... 


E[Z] =), var(Z) = 2. 
e For any fixed nonnegative integer k, the binomial probability 


ps(k) = x me? ape 


(n—k 
converges to pz(k), when we take the limit as n — oo and p = X/n, 
while keeping constant. 


e In general, the Poisson PMF is a good approximation to the binomial 
as long as A = np, n is very large, and p is very small. 


The verification of the limiting behavior of the binomial probabilities was 
given in Chapter 2 as as an end-of-chapter problem, and is replicated here for 
convenience. We let p = A/n and note that 


n! 


ps(k) = Gene _ p)n-k 


_ rn=1)---(n—k+1) “(1 ve 


k! nk 
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n (n-1) (n—k+1) “(1 oe: 


n n n k! n 


Let us focus on a fixed & and let n — oo. Each one of the ratios (n — 1)/n, 
(n — 2)/n,...,(1—k+1)/n converges to 1. Furthermore, | 


ifs n 
(1- *) — 1, (.- *) —e", 
n n 


We conclude that for each fixed k, and as n — oo, we have 


Example 5.5. Asa rule of thumb, the Poisson/binomial approximation 


is valid to several decimal places if n > 100, p < 0.01, and A = np. To check this, 
consider the following. 

Gary Kasparov, the world chess champion (as of 1999) plays against 100 ama- 
teurs in a large simultaneous exhibition. It has been estimated from past experience 
that Kasparov wins in such exhibitions 99% of his games on the average (in precise 
probabilistic terms, we assume that he wins each game with probability 0.99, inde- 
pendently of other games). What are the probabilities that he will win 100 games, 
98 games, 95 games, and 90 games? 

We model the number of games X that Kasparov does not win as a binomial 
random variable with parameters n = 100 and p = 0.01. Thus the probabilities 
that he will win 100 games, 98, 95 games, and 90 games are 


px (0) = (1 —0.01)'°° = 0.366, 
100! 


px(2) = sapree — 0.01)** = 0.185, 
Ie das ig 
px (5) = sae0-01°(1 — 0.01)° = 0.00290, 
100! 10 90 —8 
px (10) = go775 0-019 — 0.01) = 7.006 x 10°, 


{ We are using here, the well known formula lim,...(1 — 1)" =e'. Letting 
x = n/A, we have limn—oo(1 — Ayn/r =e", from which it follows that limn—o(1 — 
Ayn = eo. 
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respectively. Now let us check the corresponding Poisson approximations with \ = 
100- 0.01 = 1. They are: 


pz(0) = ets = 0.368, 

pz(2) = ete = 0.184, 

p2(5) = etn = 0.00306, 
pz(10) = cto = 1.001 x 10-8. 


By comparing the binomial PMF values px (k) with their Poisson approximations 
pz(k), we see that there is close agreement. 

Suppose now that Kasparov plays simultaneously just 5 opponents, who are, 
however, stronger so that his probability of a win per game is 0.9. Here are the 
binomial probabilities px (k) for n = 5 and p = 0.1, and the corresponding Poisson 
approximations pz(k) for \ = np = 0.5, 


px(0) =0.590, — pz(0) = 0.605, 
px(1) =0.328, pz (1) = 0.303, 
px (2) =0.0729, pz (2) = 0.0758, 
px(3) =0.0081, pz (3) = 0.0126, 


px(4) = 0.00045, — pz (4) = 0.0016, 
px (5) =0.00001, pz (5) = 0.00016. 


We see that the approximation, while not poor, is considerably less accurate than 
in the case where n = 100 and p = 0.01. 


Example 5.6. A packet consisting of a string of n symbols is transmitted over 
a noisy channel. Each symbol has probability p = 0.0001 of being transmitted in 
error, independently of errors in the other symbols. How small should n be in order 
for the probability of incorrect transmission (at least one symbol in error) to be less 
than 0.001? 

Each symbol transmission is viewed as an independent Bernoulli trial. Thus, 
the probability of a positive number S of errors in the packet is 


i= PS =0) S111 Sp)", 


For this probability to be less than 0.001, we must have 1 — (1 — 0.0001)” < 0.001 
or 


In 0.999 
In 0.9999 
We can also use the Poisson approximation for P(S = 0), which is e~* with \ = 
np = 0.0001 - n, and obtain the condition 1 — e~ °°!” < 0.001, which leads to 
—1n0.999 
0.0001 


Given that n must be integer, both methods lead to the same conclusion that n 
can be at most 10. 


= 10.0045. 


= 10.005. 


5.2 
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THE POISSON PROCESS 


The Poisson process can be viewed as a continuous-time analog of the Bernoulli 
process and applies to situations where there is no natural way of dividing time 
into discrete periods. 

To see the need for a continuous-time version of the Bernoulli process, let 
us consider a possible model of traffic accidents within a city. We can start by 
discretizing time into one-minute periods and record a “success” during every 
minute in which there is at least one traffic accident. Assuming the traffic in- 
tensity to be constant over time, the probability of an accident should be the 
same during each period. Under the additional (and quite plausible) assumption 
that different time periods are independent, the sequence of successes becomes a 
Bernoulli process. Note that in real life, two or more accidents during the same 
one-minute interval are certainly possible, but the Bernoulli process model does 
not keep track of the exact number of accidents. In particular, it does not allow 
us to calculate the expected number of accidents within a given period. 

One way around this difficulty is to choose the length of a time period to be 
very small, so that the probability of two or more accidents becomes negligible. 
But how small should it be? A second? A millisecond? Instead of answering 
this question, it is preferable to consider a limiting situation where the length of 
the time period becomes zero, and work with a continuous time model. 

We consider an arrival process that evolves in continuous time, in the sense 
that any real number ¢ is a possible arrival time. We define 


P(k,7) = P(there are exactly k arrivals during an interval of length 7), 


and assume that this probability is the same for all intervals of the same length 
tT. We also introduce a positive parameter » to be referred to as the arrival 
rate or intensity of the process, for reasons that will soon be apparent. 


Definition of the Poisson Process 


An arrival process is called a Poisson process with rate A if it has the fol- 
lowing properties: 


(a) (Time-homogeneity.) The probability P(k,7) of k arrivals is the 
same for all intervals of the same length 7. 


(b) (Independence.) The number of arrivals during a particular interval 
is independent of the history of arrivals outside this interval. 


(c) (Small interval probabilities.) The probabilities P(k,7) satisfy 


P(0,7) =1—Ar+o(r), 
P(1,7) = At + 01(7). 
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Here, o(7) and 01(7) are functions of 7 that satisfy 


The first property states that arrivals are “equally likely” at all times. The 
arrivals during any time interval of length 7 are statistically the same, in the 
sense that they obey the same probability law. This is a counterpart of the 
assumption that the success probability p in a Bernoulli process is constant over 
time. 

To interpret the second property, consider a particular interval [t,t’], of 
length ¢t/ — t. The unconditional probability of k arrivals during that interval 
is P(k,t’/ — t). Suppose now that we are given complete or partial information 
on the arrivals outside this interval. Property (b) states that this information 
is irrelevant: the conditional probability of k arrivals during [t, t’] remains equal 
to the unconditional probability P(k,t/ — t). This property is analogous to the 
independence of trials in a Bernoulli process. 

The third property is critical. The o(7) and 01(r) terms are meant to be 
negligible in comparison to 7, when the interval length 7 is very small. They can 
be thought of as the O(r?) terms in a Taylor series expansion of P(k,7). Thus, 
for small 7, the probability of a single arrival is roughly Av, plus a negligible 
term. Similarly, for small 7, the probability of zero arrivals is roughly 1 — Ar. 
Note that the probability of two or more arrivals is 


1— P(0,r) — P(1,r) = —o(r) — o1(r), 


and is negligible in comparison to P(1,7) as 7 gets smaller and smaller. 


number of probability of success expected number 
periods: per period: of arrivals: 
n=1/8 p =A6 np=At 


666666 66 


= ; 
Arrivals Time 


Figure 5.5: Bernoulli approximation of the Poisson process. 


Let us now start with a fixed time interval of length 7 and partition it 
into 7/0 periods of length 6, where 6 is a very small number; see Fig. 5.5. The 
probability of more than two arrivals during any period can be neglected, because 
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of property (c) and the preceding discussion. Different periods are independent, 
by property (b). Furthermore, each period has one arrival with probability 
approximately equal to 6, or zero arrivals with probability approximately equal 
to 1 — Ad. Therefore, the process being studied can be approximated by a 
Bernoulli process, with the approximation becoming more and more accurate 
the smaller 6 is chosen. Thus the probability P(k,7) of k arrivals in time 7, is 
approximately the same as the (binomial) probability of & successes in n = 7/6 
independent Bernoulli trials with success probability p = Ad at each trial. While 
keeping the length 7 of the interval fixed, we let the period length 6 decrease 
to zero. We then note that the number n of periods goes to infinity, while the 
product np remains constant and equal to Ar. Under these circumstances, we 
saw in the previous section that the binomial PMF converges to a Poisson PMF 
with parameter AT. We are then led to the important conclusion that 


ke-AT 
P(k,r) = OE Ooi. 


Note that a Taylor series expansion of e~>7, yields 


P(0,r) =e? = 1—AT+ O(7?) 
P(1,7) = Are~*7 = Ar — A272 4+ O(73) = AT + O(7?), 


consistent with property (c). 
Using our earlier formulas for the mean and variance of the Poisson PMF, 
we obtain 


E[N,] = 27, var(N;) = Ar, 


where N, stands for the number of arrivals during a time interval of length r. 
These formulas are hardly surprising, since we are dealing with the limit of a 
binomial PMF with parameters n = 7/6, p = 6, mean np = XT, and variance 
np(1 —p) & np = Xr. 

Let us now derive the probability law for the time T of the first arrival, 
assuming that the process starts at time zero. Note that we have T > t if and 
only if there are no arrivals during the interval [0, t]. Therefore, 


Fr(t) = P(T <t)=1-P(T >t) =1-P(0,f)=1-e-*%, t20. 
We then differentiate the CDF F'r(t) of T, and obtain the PDF formula 
frit) = Aer, t= 0, 


which shows that the time until the first arrival is exponentially distributed with 
parameter A. We summarize this discussion in the table that follows. See also 
Fig. 5.6. 
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Random Variables Associated with the Poisson Process and their 
Properties 


e The Poisson with parameter \7T. This is the number JN, of arrivals 


in a Poisson process with rate A, over an interval of length 7. Its PMF, 
mean, and variance are 


pn, (k) = P(k,7) 


I 
si 

l 
—) 
ie 


E[N-] = Ar, var(N;) = Ar. 


e The exponential with parameter . This is the time T until the 
first arrival. Its PDF, mean, and variance are 


fr(t)=Ae",_—-t > 0, E[T] = ; var(T) = 


66666666 


p=Aa6 
Arrivals Time 
POISSON | BERNOULLI 
Times of Arrival Continuous Discrete 
PMF of # of Arrivals Poisson Binomial 
Interarrival Time CDF Exponential Geometric 
Arrival Rate A/unit time p/per trial 


Figure 5.6: View of the Bernoulli process as the discrete-time version of the 
Poisson. We discretize time in small intervals 6 and associate each interval with 
a Bernoulli trial whose parameter is p = Ad. The table summarizes some of the 
basic correspondences. 
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Example 5.7. You get email according to a Poisson process at a rate of A = 0.2 
messages per hour. You check your email every hour. What is the probability of 
finding 0 and 1 new messages? 

These probabilities can be found using the Poisson PMF (Ar)*e~>7 /k!, with 
T=1,andk=Oork=1: 


P(0,1) =e-°? =0.819, = P(1,1) = 0.2-e"°? = 0.164 


Suppose that you have not checked your email for a whole day. What is the 
probability of finding no new messages? We use again the Poisson PMF and obtain 


P(0, 24) = e °?4 = 0.008294. 


Alternatively, we can argue that the event of no messages in a 24-hour period is 
the intersection of the events of no messages during each of 24 hours. These latter 


events are independent and the probability of each is P(0,1) = e 9? so 


P(0, 24) = (P(0,1))" = (e°°?)™* = 0.008294, 


which is consistent with the preceding calculation method. 


Example 5.8. Sum of Independent Poisson Random Variables. Arrivals 
of customers at the local supermarket are modeled by a Poisson process with a 
rate of A = 10 customers per minute. Let M be the number of customers arriving 
between 9:00 and 9:10. Also, let N be the number of customers arriving between 
9:30 and 9:35. What is the distribution of M+ N? 

We notice that M is Poisson with parameter w = 10-10 = 100 and N is Poisson 
with parameter vy = 10-5 = 50. Furthermore, M and N are independent. As shown 
in Section 4.1, using transforms, J+ N is Poisson with parameter t+v = 150. We 
will now proceed to derive the same result in a more direct and intuitive manner. 

Let N be the number of customers that arrive between 9:10 and 9:15. Note 
that N has the same distribution as N (Poisson with parameter 50). Furthermore, 
N is also independent of N. Thus, the distribution of M +N is the same as the 
distribution of M+ N. But M + N is the number of arrivals during an interval of 
length 15, and has therefore a Poisson distribution with parameter 10-15 = 150. 

This example makes a point that is valid in general. The probability of k 
arrivals during a set of times of total length 7 is always given by P(k,7), even if 
that set is not an interval. (In this example, we dealt with the set [9 : 00,9 : 10]U[9: 
30,9 : 35], of total length 15.) 


Example 5.9. During rush hour, from 8 am to 9 am, traffic accidents occur 
according to a Poisson process with a rate of 5 accidents per hour. Between 9 
am and 11 am, they occur as an independent Poisson process with a rate v of 3 
accidents per hour. What is the PMF of the total number of accidents between 8 
am and 11 am? 
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This is the sum of two independent Poisson random variables with parameters 


5 and 3-2 = 6, respectively. Since the sum of independent Poisson random variables 
is also Poisson, the total number of accidents has a Poisson PMF with parameter 
5+6=11. 


Independence and Memorylessness 


The Poisson process has several properties that parallel those of the Bernoulli 
process, including the independence of nonoverlapping time sets, a fresh-start 
property, and the memorylessness of the interarrival time distribution. Given 
that the Poisson process can be viewed as a limiting case of a Bernoulli process, 
the fact that it inherits the qualitative properties of the latter should be hardly 
surprising. 


(a) 


Independence of nonoverlapping sets of times. Consider two disjoint 
sets of times A and B, such as A = [0,1] U [4,00) and B = [1.5,3.6], for 
example. If U and V are random variables that are completely determined 
by what happens during A (respectively, B), then U and V are indepen- 
dent. This is a consequence of the second defining property of the Poisson 
process. 


Fresh-start property. As a special case of the preceding observation, we 
notice that the history of the process until a particular time t is independent 
from the future of the process. Furthermore, if we focus on that portion 
of the Poisson process that starts at time t, we observe that it inherits the 
defining properties of the original process. For this reason, the portion of 
the Poisson process that starts at any particular time t > 0 is a probabilistic 
replica of the Poisson process starting at time 0, and is independent of the 
portion of the process prior to time t. Thus, we can say that the Poisson 
process starts afresh at each time instant. 


Memoryless interarrival time distribution. We have already seen that 
the geometric PMF (interarrival time in the Bernoulli process) is memo- 
ryless: the number of remaining trials until the first future arrival does 
not depend on the past. The exponential PDF (interarrival time in the 
Poisson process) has a similar property: given the current time t and the 
past history, the future is a fresh-starting Poisson process, hence the re- 
maining time until the next arrival has the same exponential distribution. 
In particular, if TJ is the time of the first arrival and if we are told that 
T >t, then the remaining time T’—t is exponentially distributed, with the 
same parameter A. For an algebraic derivation of this latter fact, we first 
use the exponential CDF to obtain P(T > t) = e~*#. We then note that 
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for all positive scalars s and t, we have 


PT >t+s|T>y =P Poy 20) 
_ PW >t+3) 
P(T >t) 
e7rA(t+s) 
eat 
= e-As, 


Here are some examples of reasoning based on the memoryless property. 


Example 5.10. You and your partner go to a tennis court, and have to wait until 
the players occupying the court finish playing. Assume (somewhat unrealistically) 
that their playing time has an exponential PDF. Then the PDF of your waiting 
time (equivalently, their remaining playing time) also has the same exponential 
PDF, regardless of when they started playing. 


Example 5.11. When you enter the bank, you find that all three tellers are busy 
serving other customers, and there are no other customers in queue. Assume that 
the service times for you and for each of the customers being served are independent 
identically distributed exponential random variables. What is the probability that 
you will be the last to leave? 

The answer is 1/3. To see this, focus at the moment when you start service 
with one of the tellers. Then, the remaining time of each of the other two customers 
being served, as well as your own remaining time, have the same PDF. Therefore, 
you and the other two customers have equal probability 1/3 of being the last to 
leave. 


Interarrival Times 


An important random variable associated with a Poisson process that starts at 
time 0, is the time of the Ath arrival, which we denote by Y;. A related random 
variable is the kth interarrival time, denoted by 7}. It is defined by 


T, =i, Th=Ye—Ye-1, k=2,3,... 


and represents the amount of time between the & — 1st and the kth arrival. Note 
that 
Yr =T1 + To+---+ Th. 


We have already seen that the time 7; until the first arrival is an exponen- 
tial random variable with parameter ». Starting from the time 7) of the first 
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arrival, the future is a fresh-starting Poisson process. Thus, the time until the 
next arrival has the same exponential PDF. Furthermore, the past of the process 
(up to time 71) is independent of the future (after time T,). Since T2 is deter- 
mined exclusively by what happens in the future, we see that T> is independent 
of T;. Continuing similarly, we conclude that the random variables T), 72, 73,... 
are independent and all have the same exponential distribution. 

This important observation leads to an alternative, but equivalent, way of 
describing the Poisson process. | 


Alternative Description of the Poisson Process 


1. Start with a sequence of independent exponential random variables 
T,,T2,..., with common parameter A, and let these stand for the in- 
terarrival times. 


2. Record an arrival at times T), 7; + To, 7; + To + 73, etc. 


The kth Arrival Time 


The time Y; of the kth arrival is equal to the sum Y, = 7, + 72 +---+ Ty of 
k independent identically distributed exponential random variables. This allows 
us to derive formulas for the mean, variance, and PMF of Y;, which are given in 
the table that follows. 


Properties of the kth Arrival Time 


e The kth arrival time is equal to the sum of the first & interarrival times 
Y, =T + To.4+---+Th, 


and the latter are independent exponential random variables with com- 
mon parameter i. 


+ In our original definition, a process was called Poisson if it possessed certain 
properties. However, the astute reader may have noticed that we have not so far 
established that there exists a process with the required properties. In an alternative 
line of development, we could have defined the Poisson process by the alternative 
description given here, and such a process is clearly well-defined: we start with a 
sequence of independent interarrival times, from which the arrival times are completely 
determined. Starting with this definition, it is then possible to establish that the 
process satisfies all of the properties that were postulated in our original definition. 
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e The mean and variance of Y; are given by 


E/Y;] = E(t] apie ale E{T;| = 7 
k 


var(Y;) = var(T)) +--+: + var(T,) = vr 


e The PDF of Y; is given by 


\eyk—-leAy 


fy, (y) = (k — 1)! 


and is known as the Erlang PDF of order k. 


To evaluate the PDF fy, of Y;, we can argue that for a small 6, the product 


6: fy, (y) is the probability that the kth arrival occurs between times y and y+d J 
When 6 is very small, the probability of more than one arrival during the interval 
[y,y + O] is negligible. Thus, the kth arrival occurs between y and y + 6 if and 
only if the following two events A and B occur: 


(a) event A: there is an arrival during the interval [y, y + 4]; 
(b) event B: there are exactly k — 1 arrivals before time y. 
The probabilities of these two events are 
\k-Lyk-Le-Ay 


P(4)@A6, and P(B) = P(k— 1,9) = Ey 


{ For an alternative derivation that does not rely on approximation arguments, 
note that for a given y > 0, the event {Yi < y} is the same as the event 


{number of arrivals in the interval [0, y] > k}. 


Thus the CDF of Yx is given by 


co k-1 k-1 


Fy, (y) =P(¥%e <y) = 3) Py) =1- > P(ny) =1 


n=k n=0 n=0 


The PDF of Y; can be obtained by differentiating the above expression, which by 
straightforward calculation yields the Erlang PDF formula 


d eye le Ay 
fy, (y) = aye) = “(k-D! 
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Since A and B are independent, we have 


Nk-Lyk-Le-dy 
ofy,(y) ¥ Ply < Ye Sy +6) & P(AN B) = P(A)P(B) & a= 


from which we obtain 


Aeyk—-le- Ay 


fy,,(y) = y >0. 


(k—-1! ’ = 


Example 5.12. You call the IRS hotline and you are told that you are the 
56th person in line, excluding the person currently being served. Callers depart 
according to a Poisson process with a rate of X = 2 per minute. How long will you 
have to wait on the average until your service starts, and what is the probability 
you will have to wait for more than an hour? 

By the memoryless property, the remaining service time of the person cur- 
rently being served is exponentially distributed with parameter 2. The service times 
of the 55 persons ahead of you are also exponential with the same parameter, and 
all of these random variables are independent. Thus, your waiting time Y is Erlang 
of order 56, and 
i 


The probability that you have to wait for more than an hour is given by the formula 


E[Y] 28. 


Oo \56 55 — 
roy? e AY 


P(Y > 60) = :) dy. 
- 55! 


Computing this probability is quite tedious. In Chapter 7, we will discuss a much 
easier way to compute approximately this probability. This is done using the central 
limit theorem, which allows us to approximate the CDF of the sum of a large number 
of random variables with a normal CDF and then to calculate various probabilities 
of interest by using the normal tables. 


Splitting and Merging of Poisson Processes 


Similar to the case of a Bernoulli process, we can start with a Poisson process 
with rate and split it, as follows: each arrival is kept with probability p and 
discarded with probability 1—p, independently of what happens to other arrivals. 
In the Bernoulli case, we saw that the result of the splitting was also a Bernoulli 
process. In the present context, the result of the splitting turns out to be a 
Poisson process with rate Ap. 

Alternatively, we can start with two independent Poisson processes, with 
rates A; and A2, and merge them by recording an arrival whenever an arrival 
occurs in either process. It turns out that the merged process is also Poisson 
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with rate 41 + A2. Furthermore, any particular arrival of the merged process 
has probability \1/(A1 + A2) of originating from the first process and probability 
A2/(A1 + A2) of originating from the second, independently of all other arrivals 
and their origins. 

We discuss these properties in the context of some examples, and at the 
same time provide a few different arguments to establish their validity. 


Example 5.13. Splitting of Poisson Processes. A packet that arrives at a 
node of a data network is either a local packet which is destined for that node (this 
happens with probability p), or else it is a transit packet that must be relayed to 
another node (this happens with probability 1 — p). Packets arrive according to a 
Poisson process with rate 4, and each one is a local or transit packet independently 
of other packets and of the arrival times. As stated above, the process of local 
packet arrivals is Poisson with rate Ap. Let us see why. 

We verify that the process of local packet arrivals satisfies the defining prop- 
erties of a Poisson process. Since \ and p are constant (do not change with time), 
the first property (time homogeneity) clearly holds. Furthermore, there is no de- 
pendence between what happens in disjoint time intervals, verifying the second 
property. Finally, if we focus on an interval of small length 6, the probability of 
a local arrival is approximately the probability that there is a packet arrival, and 
that this turns out to be a local one, i.e., A06-p. In addition, the probability of 
two or more local arrivals is negligible in comparison to 6, and this verifies the 
third property. We conclude that local packet arrivals form a Poisson process and, 
in particular, the number L, of such arrivals during an interval of length 7 has a 
Poisson PMF with parameter pAr. 

Let us now rederive the Poisson PMF of L- using transforms. The total 
number of packets N; during an interval of length 7 is Poisson with parameter Ar. 
For i = 1,...,.N-, let X; be a Bernoulli random variable which is 1 if the 7th packet 
is local, and 0 if not. Then, the random variables X1, X2,... form a Bernoulli 
process with success probability p. The number of local packets is the number of 
“successes,” i.e., 


Lr = Xi t+ 4+Xn,. 


We are dealing here with the sum of a random number of independent random 
variables. As discussed in Section 4.4, the transform associated with L, is found 
by starting with the transform associated with N-, which is 


8 


My, (s) 7 ertle me 
and replacing each occurrence of e* by the transform associated with X;, which is 
Mx(s)=1-—p+pe’. 


We obtain 7 ‘ 
Mz, (s) = edt (l—ptpe —1) = erTPle = 


We observe that this is the transform of a Poisson random variable with parameter 
Atp, thus verifying our earlier statement for the PMF of L-. 
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We conclude with yet another method for establishing that the local packet 
process is Poisson. Let 71,72,... be the interarrival times of packets of any type; 
these are independent exponential random variables with parameter ». Let K be 
the total number of arrivals up to and including the first local packet arrival. In 
particular, the time S of the first local packet arrival is given by 


S=T1+T2+---+Tk. 


Since each packet is a local one with probability p, independently of the others, and 
by viewing each packet as a trial which is successful with probability p, we recognize 
K asa geometric random variable with parameter p. Since the nature of the packets 
is independent of the arrival times, K is independent from the interarrival times. We 
are therefore dealing with a sum of a random (geometrically distributed) number of 
exponential random variables. We have seen in Chapter 4 (cf. Example 4.21) that 
such a sum is exponentially distributed with parameter Ap. Since the interarrival 
times between successive local packets are clearly independent, it follows that the 
local packet arrival process is Poisson with rate Ap. 


Example 5.14. Merging of Poisson Processes. People with letters to mail 
arrive at the post office according to a Poisson process with rate Ai, while people 
with packages to mail arrive according to an independent Poisson process with rate 
A2. As stated earlier the merged process, which includes arrivals of both types, is 
Poisson with rate A; + A2. Let us see why. 

First, it should be clear that the merged process satisfies the time-homogeneity 
property. Furthermore, since different intervals in each of the two arrival processes 
are independent, the same property holds for the merged process. Let us now focus 
on a small interval of length 6. Ignoring terms that are negligible compared to 6, 
we have 


P(0 arrivals in the merged process) © (1 — A16)(1 — A2d) & 1 — (Ar + A2)6, 
P(1 arrival in the merged process) © A16(1 — A26) + (1 — A16)A2d & (Ar + A2)6, 


and the third property has been verified. 

Given that an arrival has just been recorded, what is the probability that it 
is an arrival of a person with a letter to mail? We focus again on a small interval 
of length 6 around the current time, and we seek the probability 


P(1 arrival of person with a letter|1 arrival). 


Using the definition of conditional probabilities, and ignoring the negligible proba- 
bility of more than one arrival, this is 


P(1 arrival of person with a letter) Ue A160 AM 
P(1 arrival) “Ort A2)6 A td! 


Example 5.15. Competing Exponentials. Two light bulbs have independent 
and exponentially distributed lifetimes T and TO), with parameters Ai and A2, 
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respectively. What is the distribution of the first time Z = min{T™,T®} at which 
a bulb burns out? 

We can treat this as an exercise in derived distributions. For all z > 0, we 
have, 


F2z(z) = P(min{ TT} < z) 
=1-P(min{T™,T} > z) 
=1-P(T™ > z, T® >z) 
=1-P(T™ > z)P(T™ > z) 


ALz—AQz 


=l-e “l’e 


ye eo A1tA2)2 


This is recognized as the exponential CDF with parameter 1 + A2. Thus, the mini- 
mum of two independent exponentials with parameters \1 and Az is an exponential 
with parameter A1 + A2. 

For a more intuitive explanation of this fact, let us think of T (respectively, 
TO) as the times of the first arrival in two independent Poisson process with rate 
A1 (respectively, T), If we merge these two Poisson processes, the first arrival 
time will be min{T,T}. But we already know that the merged process is 
Poisson with rate \1 + A2, and it follows that the first arrival time, min{T™, TO}, 
is exponential with parameter Ai + Ae. 


The preceding discussion can be generalized to the case of more than two 
processes. Thus, the total arrival process obtained by merging the arrivals of 
nm independent Poisson processes with arrival rates A1,...,An is Poisson with 
arrival rate equal to the sum Ay +---+ An. 


Example 5.16. More on Competing Exponentials. Three light bulbs have 
independent exponentially distributed lifetimes with a common parameter A. What 
is the expectation of the time until the last bulb burns out? 

We think of the times when each bulb burns out as the first arrival times 
in independent Poisson processes. In the beginning, we have three bulbs, and the 
merged process has rate 3A. Thus, the time 7) of the first burnout is exponential 
with parameter 3A, and mean 1/3. Once a bulb burns out, and because of the 
memorylessness property of the exponential distribution, the remaining lifetimes 
of the other two lightbulbs are again independent exponential random variables 
with parameter 4. We thus have two Poisson processes running in parallel, and 
the remaining time 7> until the first arrival in one of these two processes is now 
exponential with parameter 2\ and mean 1/2X. Finally, once a second bulb burns 
out, we are left with a single one. Using memorylessness once more, the remaining 
time 73 until the last bulb burns out is exponential with parameter 4 and mean 
1/X. Thus, the expectation of the total time is 


1 1S AN a 
E[fi + T2 + Ts) = 54 tT 
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Note that the random variables 7), T2, T3 are independent, because of memoryless- 
ness. This also allows us to compute the variance of the total time: 


re ee 
+ > + 


var(T; + T2 + T3) = var(T1) + var(T2) + var(T3) = D2 + pete 


We close by noting a related and quite deep fact, namely that the sum 
of a large number of (not necessarily Poisson) independent arrival processes, 
can be approximated by a Poisson process with arrival rate equal to the sum of 
the individual arrival rates. The component processes must have a small rate 
relative to the total (so that none of them imposes its probabilistic character on 
the total arrival process) and they must also satisfy some technical mathematical 
assumptions. Further discussion of this fact is beyond our scope, but we note 
that it is in large measure responsible for the abundance of Poisson-like processes 
in practice. For example, the telephone traffic originating in a city consists of 
many component processes, each of which characterizes the phone calls placed by 
individual residents. The component processes need not be Poisson; some people 
for example tend to make calls in batches, and (usually) while in the process of 
talking, cannot initiate or receive a second call. However, the total telephone 
traffic is well-modeled by a Poisson process. For the same reasons, the process 
of auto accidents in a city, customer arrivals at a store, particle emissions from 
radioactive material, etc., tend to have the character of the Poisson process. 


The Random Incidence Paradox 


The arrivals of a Poisson process partition the time axis into a sequence of 
interarrival intervals; each interarrival interval starts with an arrival and ends at 
the time of the next arrival. We have seen that the lengths of these interarrival 
intervals are independent exponential random variables with parameter » and 
mean 1/\, where \ is the rate of the process. More precisely, for every k, the 
length of the kth interarrival interval has this exponential distribution. In this 
subsection, we look at these interarrival intervals from a different perspective. 

Let us fix a time instant ¢* and consider the length L of the interarrival 
interval to which it belongs. For a concrete context, think of a person who shows 
up at the bus station at some arbitrary time ¢* and measures the time from the 
previous bus arrival until the next bus arrival. The arrival of this person is often 
referred to as a “random incidence,” but the reader should be aware that the 
term is misleading: t* is just a particular time instance, not a random variable. 

We assume that ¢* is much larger than the starting time of the Poisson 
process so that we can be fairly certain that there has been an arrival prior 
to time t*. To avoid the issue of determining how large a ¢* is large enough, 
we can actually assume that the Poisson process has been running forever, so 
that we can be fully certain that there has been a prior arrival, and that LD is 
well-defined. One might superficially argue that L is the length of a “typical” 
interarrival interval, and is exponentially distributed, but this turns out to be 
false. Instead, we will establish that LZ has an Erlang PDF of order two. 
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This is known as the random incidence phenomenon or paradoa, and it can 
be explained with the help of Fig. 5.7. Let [U,V] be the interarrival interval to 
which t* belongs, so that L = V —U. In particular, U is the time of the first 
arrival prior to t* and V is the time of the first arrival after t*. We split D into 
two parts, 

L=(t* -U)+(V -#*), 


where t* —U is the elapsed time since the last arrival, and V —¢* is the remaining 
time until the next arrival. Note that t* —U is determined by the past history of 
the process (before t*), while V — t* is determined by the future of the process 
(after time ¢*). By the independence properties of the Poisson process, the 
random variables t* — U and V — ¢* are independent. By the memorylessness 
property, the Poisson process starts fresh at time t*, and therefore V — t* is 
exponential with parameter 4. The random variable t* — U is also exponential 
with parameter A. The easiest way of seeing this is to realize that if we run a 
Poisson process backwards in time it remains Poisson; this is because the defining 
properties of a Poisson process make no reference to whether time moves forward 
or backward. A more formal argument is obtained by noting that 


P(t* — U > x) = P(no arrivals during [¢* — x, t*]) = P(0,z) =e—*, x>0. 


We have therefore established that L is the sum of two independent exponential 
random variables with parameter A, i.e., Erlang of order two, with mean 2/.. 


7 fh XS Time 
Elapsed Chosen Remaining 
ine Al time instant time V-t’ 


Figure 5.7: Illustration of the random incidence phenomenon. For a fixed time 
instant t*, the corresponding interarrival interval [U,V] consists of the elapsed 
time t* — U and the remaining time V — t*. These two times are independent 
and are exponentially distributed with parameter A, so the PDF of their sum is 
Erlang of order two. 


Random incidence phenomena are often the source of misconceptions and 
errors, but these can be avoided with careful probabilistic modeling. The key 
issue is that even though interarrival intervals have length 1/A on the average, an 
observer who arrives at an arbitrary time is more likely to fall in a large rather 
than a small interarrival interval. As a consequence the expected length seen by 
the observer is higher, 2/ in this case. This point is amplified by the example 
that follows. 
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Example 5.17. Random incidence in a non-Poisson arrival process. Buses 
arrive at a station deterministically, on the hour, and fifteen minutes after the hour. 
Thus, the interarrival times alternate between 15 and 45 minutes. The average 
interarrival time is 30 minutes. A person shows up at the bus station at a “random” 
time. We interpret “random” to mean a time which is uniformly distributed within 
a particular hour. Such a person falls into an interarrival interval of length 15 with 
probability 1/4, and an interarrival interval of length 45 with probability 3/4. The 
expected value of the length of the chosen interarrival interval is 


1 3 
15-- 445-5 = 37. 
5-5 +45. 5 = 3755, 


which is considerably larger than 30, the average interarrival time. 
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The Bernoulli and Poisson processes studied in the preceding chapter are memo- 
ryless, in the sense that the future does not depend on the past: the occurrences 
of new “successes” or “arrivals” do not depend on the past history of the process. 
In this chapter, we consider processes where the future depends on and can be 
predicted to some extent by what has happened in the past. 

We emphasize models where the effect of the past on the future is summa- 
rized by a state, which changes over time according to given probabilities. We 
restrict ourselves to models whose state can take a finite number of values and 
can change in discrete instants of time. We want to analyze the probabilistic 
properties of the sequence of state values. 

The range of applications of the models of this chapter is truly vast. It 
includes just about any dynamical system whose evolution over time involves 
uncertainty, provided the state of the system is suitably defined. Such systems 
arise in a broad variety of fields, such as communications, automatic control, 
signal processing, manufacturing, economics, resource allocation, etc. 


DISCRETE-TIME MARKOV CHAINS 


We will first consider discrete-time Markov chains, in which the state changes 
at certain discrete time instants, indexed by an integer variable n. At each time 
step n, the Markov chain has a state, denoted by X;,, which belongs to a finite 
set S of possible states, called the state space. Without loss of generality, and 
unless there is a statement to the contrary, we will assume that S = {1,..., mb}, 
for some positive integer m. The Markov chain is described in terms of its 
transition probabilities p;;: whenever the state happens to be 2, there is 
probability p;; that the next state is equal to 7. Mathematically, 


pig = P\Xnt1 = 7 |Xn =), 1,j ES. 


The key assumption underlying Markov processes is that the transition proba- 
bilities p;; apply whenever state i is visited, no matter what happened in the 
past, and no matter how state 7 was reached. Mathematically, we assume the 
Markov property, which requires that 


P(Xn41 =J| Xn = 1, Xn-1 = in-1,---,X0 = t0) = P(Xngi = j | Xn = 1) 
= Dij; 


for all times n, all states i, 7 € S, and all possible sequences 7o,...,%¢n—1 of earlier 
states. Thus, the probability law of the next state Xn+1 depends on the past 
only through the value of the present state Xn. 

The transition probabilities p;; must be of course nonnegative, and sum to 
one: 


Spy 1;. - forall: 
j=l 
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We will generally allow the probabilities p;; to be positive, in which case it is 
possible for the next state to be the same as the current one. Even though the 
state does not change, we still view this as a state transition of a special type (a 
“selftransition” ). 


Specification of Markov Models 

e A Markov chain model is specified by identifying 
(a) the set of states S = {1,...,m}, 
(b) the set of possible transitions, namely, those pairs (2, 7) for which 

pis > 9, and, 

(c) the numerical values of those p;; that are positive. 

e The Markov chain specified by this model is a sequence of random 

variables Xo, X1, X2,..., that take values in S and which satisfy 


P(Xn41 = j| Xn =1, Xn 1=tn 1,;---,Xo0 = to) = pij, 


for all times n, all states 7,7 € S, and all possible sequences io, ..., in—1 
of earlier states. 


All of the elements of a Markov chain model can be encoded in a transition 
probability matrix, which is simply a two-dimensional array whose element 
at the 7th row and jth column is p;;: 


Pll P12 oe Pim 
P2i p22 ais P2m 
Pm1i Pm2 °° Dmm 


It is also helpful to lay out the model in the so-called transition probability 
graph, whose nodes are the states and whose arcs are the possible transitions. 
By recording the numerical values of p;; near the corresponding arcs, one can 
visualize the entire model in a way that can make some of its major properties 
readily apparent. 


Example 6.1. Alice is taking a probability class and in each week she can be 
either up-to-date or she may have fallen behind. If she is up-to-date in a given 
week, the probability that she will be up-to-date (or behind) in the next week is 
0.8 (or 0.2, respectively). If she is behind in the given week, the probability that 
she will be up-to-date (or behind) in the next week is 0.6 (or 0.4, respectively). We 
assume that these probabilities do not depend on whether she was up-to-date or 
behind in previous weeks, so the problem has the typical Markov chain character 
(the future depends on the past only through the present). 
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Let us introduce states 1 and 2, and identify them with being up-to-date and 
behind, respectively. Then, the transition probabilities are 


pu = 0.8, pi2 = 0.2, pa = 0.6, p22 = 0.4, 


and the transition probability matrix is 


0.8 0.2 
0.6 0.4] 


The transition probability graph is shown in Fig. 6.1. 


0.2 


Up-to-Date 0.6 Behind 


Figure 6.1: The transition probability graph in Example 6.1. 


Example 6.2. A fly moves along a straight line in unit increments. At each 
time period, it moves one unit to the left with probability 0.3, one unit to the right 
with probability 0.3, and stays in place with probability 0.4, independently of the 
past history of movements. A spider is lurking at positions 1 and m: if the fly 
lands there, it is captured by the spider, and the process terminates. We want to 
construct a Markov chain model, assuming that the fly starts in one of the positions 
2,...,m—l. 

Let us introduce states 1,2,...,m, and identify them with the corresponding 
positions of the fly. The nonzero transition probabilities are 


Pu= 1, Pmm = 1, 


ay 7 A aa, fori =2,...,m—1. 


0.4 if 7 =i, 
The transition probability graph and matrix are shown in Fig. 6.2. 
Given a Markov chain model, we can compute the probability of any partic- 


ular sequence of future states. This is analogous to the use of the multiplication 
rule in sequential (tree) probability models. In particular, we have 


P(Xo = to, X1 = 11, oes Xin = in) = P(Xo = 10) Dipiz Piz in "'*Din_yin: 
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Figure 6.2: The transition probability graph and the transition probability ma- 
trix in Example 6.2, for the case where m = 4. 


To verify this property, note that 


P(Xo =10,X1=1,..., Xin = in) 
= P(X, =, | Xo = 1t0,.--,Xn-1 = in—1)P(Xo =19;210,An—t = in—1) 
= Pin sin P (Xo = to,---,Xn—-1 = in-1), 


where the last equality made use of the Markov property. We then apply the 
same argument to the term P(Xo = io,..., Xn—1 = in—1) and continue similarly, 
until we eventually obtain the desired expression. If the initial state Xo is given 
and is known to be equal to some ig, a similar argument yields 


P(X = ti; +s, Xaq = In | Xo = 0) = Digi Vipin ** Dig ain: 
Graphically, a state sequence can be identified with a sequence of arcs in the 
transition probability graph, and the probability of such a path (given the ini- 


tial state) is given by the product of the probabilities associated with the arcs 
traversed by the path. 


Example 6.3. For the spider and fly example (Example 6.2), we have 


P(X1 = 2, X2 = 2, X3 = 3, X4 =4| Xo = 2) = po2pe2pespsa = (0.4)7(0.3)”. 


We also have 


P(Xo = 2, X1 = 2, Xo = 2, X3 = 3, X4 = 4) = P(Xo = 2)poop22p23p34 
= P(X = 2)(0.4)?(0.3)?. 
Note that in order to calculate a probability of this form, in which there is no 


conditioning on a fixed initial state, we need to specify a probability law for the 
initial state Xo. 
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n-Step Transition Probabilities 
Many Markov chain problems require the calculation of the probability law of 


the state at some future time, conditioned on the current state. This probability 
law is captured by the n-step transition probabilities, defined by 


In words, rij(n) is the probability that the state after n time periods will be J, 
given that the current state is 7. It can be calculated using the following basic 
recursion, known as the Chapman-Kolmogorov equation. 


Chapman-Kolmogorov Equation for the n-Step Transition 
Probabilities 


The n-step transition probabilities can be generated by the recursive formula 
m 
rij(n) = S- rik(m — 1)pE;, forn>1, and all i, J, 
k=1 


starting with 
rag (1) = Dig. 


To verify the formula, we apply the total probability theorem as follows: 


Ma 


PX, =49| XoS N= Yo POG Sk HPO Ss | Xt SO Se) 


k=1 


i 
]s 


rik (n +> 1)pR33 


> 
Il 


1 


see Fig. 6.3 for an illustration. We have used here the Markov property: once 
we condition on Xn-1 = k, the conditioning on Xo = i does not affect the 
probability pz; of reaching 7 at the next step. 

We can view rij(n) as the element at the ith row and jth column of a two- 
dimensional array, called the n-step transition probability matrix.t Figures 


jt Those readers familiar with matrix multiplication, may recognize that the 
Chapman-Kolmogorov equation can be expressed as follows: the matrix of n-step tran- 
sition probabilities r;;(n) is obtained by multiplying the matrix of (n — 1)-step tran- 
sition probabilities riz(m — 1), with the one-step transition probability matrix. Thus, 
the n-step transition probability matrix is the nth power of the transition probability 
matrix. 
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Time 0 Time n-1 Time n 


Figure 6.3: Derivation of the Chapman-Kolmogorov equation. The probability 
of being at state j at time n is the sum of the probabilities r;,(nm — 1)p,; of the 
different ways of reaching 7. 


0 n 


n-step transition probabilities as a function of the numbern of transitions 


ry (2) 


Sequence of n-step transition probability matrices 


Figure 6.4: n-step transition probabilities for the “up-to-date/behind” Example 
6.1. Observe that as n — oo, rjj(n) converges to a limit that does not depend on 
the initial state. 


6.4 and 6.5 give the n-step transition probabilities ri;(n) for the cases of Ex- 
amples 6.1 and 6.2, respectively. There are some interesting observations about 
the limiting behavior of rij(n) in these two examples. In Fig. 6.4, we see that 
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each ri;(n) converges to a limit, as n — oo, and this limit does not depend on 
the initial state. Thus, each state has a positive “steady-state” probability of 
being occupied at times far into the future. Furthermore, the probability ri;(n) 
depends on the initial state « when n is small, but over time this dependence 
diminishes. Many (but by no means all) probabilistic models that evolve over 
time have such a character: after a sufficiently long time, the effect of their initial 
condition becomes negligible. 

In Fig. 6.5, we see a qualitatively different behavior: ri;(n) again converges, 
but the limit depends on the initial state, and can be zero for selected states. 
Here, we have two states that are “absorbing,” in the sense that they are infinitely 
repeated, once reached. These are the states 1 and 4 that correspond to the 
capture of the fly by one of the two spiders. Given enough time, it is certain 
that some absorbing state will be reached. Accordingly, the probability of being 
at the non-absorbing states 2 and 3 diminishes to zero as time increases. 


1.0] 0} 0] 0 
-50].17}.17].16 


0 | 0 


743) ry (4) 


Sequence of transition probability matrices 


Figure 6.5: n-step transition probabilities for the “spiders-and-fly” Example 6.2. 
Observe that rj;(n) converges to a limit that depends on the initial state. 


These examples illustrate that there is a variety of types of states and 
asymptotic occupancy behavior in Markov chains. We are thus motivated to 
classify and analyze the various possibilities, and this is the subject of the next 
three sections. 


6.2 
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CLASSIFICATION OF STATES 


In the preceding section, we saw through examples several types of Markov chain 
states with qualitatively different characteristics. In particular, some states, after 
being visited once, are certain to be revisited again, while for some other states 
this may not be the case. In this section, we focus on the mechanism by which 
this occurs. In particular, we wish to classify the states of a Markov chain with 
a focus on the long-term frequency with which they are visited. 

As a first step, we make the notion of revisiting a state precise. Let us say 
that a state 7 is accessible from a state i if for some n, the n-step transition 
probability ri;(n) is positive, i.e., if there is positive probability of reaching J, 
starting from i, after some number of time periods. An equivalent definition is 
that there is a possible state sequence 7,721,...,in—1,j, that starts at i and ends 
at j, in which the transitions (7,71), (¢1,72),-.-, (¢n—2,in—1), (¢m—1,J) all have 
positive probability. Let A(z) be the set of states that are accessible from i. We 
say that 7 is recurrent if for every j that is accessible from 7, 7 is also accessible 
from j; that is, for all 7 that belong to A(z) we have that 7 belongs to A(j). 

When we start at a recurrent state i, we can only visit states 7 € A(i) 
from which 7 is accessible. Thus, from any future state, there is always some 
probability of returning to 7 and, given enough time, this is certain to happen. 
By repeating this argument, if a recurrent state is visited once, it will be revisited 
an infinite number of times. 

A state is called transient if it is not recurrent. In particular, there are 
states 7 € A(z) such that 7 is not accessible from 7. After each visit to state 2, 
there is positive probability that the state enters such a 7. Given enough time, 
this will happen, and state 7 cannot be visited after that. Thus, a transient state 
will only be visited a finite number of times. 

Note that transience or recurrence is determined by the arcs of the tran- 
sition probability graph [those pairs (7,7) for which pjj > 0] and not by the 
numerical values of the p;;. Figure 6.6 provides an example of a transition prob- 
ability graph, and the corresponding recurrent and transient states. 


Recurrent Transient Recurrent Recurrent 


Figure 6.6: Classification of states given the transition probability graph. Start- 
ing from state 1, the only accessible state is itself, and so 1 is a recurrent state. 
States 1, 3, and 4 are accessible from 2, but 2 is not accessible from any of them, 
so state 2 is transient. States 3 and 4 are accessible only from each other (and 
themselves), and they are both recurrent. 


If ¢ is a recurrent state, the set of states A(i) that are accessible from 7 
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form a recurrent class (or simply class), meaning that states in A(i) are all 
accessible from each other, and no state outside A(z) is accessible from them. 
Mathematically, for a recurrent state 7, we have A(z) = A(j) for all j that belong 
to A(z), as can be seen from the definition of recurrence. For example, in the 
graph of Fig. 6.6, states 3 and 4 form a class, and state 1 by itself also forms a 
class. 

It can be seen that at least one recurrent state must be accessible from any 
given transient state. This is intuitively evident, and a more precise justification 
is given in the theoretical problems section. It follows that there must exist 
at least one recurrent state, and hence at least one class. Thus, we reach the 
following conclusion. 


Markov Chain Decomposition 
e A Markov chain can be decomposed into one or more recurrent classes, 
plus possibly some transient states. 


e A recurrent state is accessible from all states in its class, but is not 
accessible from recurrent states in other classes. 


e A transient state is not accessible from any recurrent state. 


e At least one, possibly more, recurrent states are accessible from a given 
transient state. 


Figure 6.7 provides examples of Markov chain decompositions. Decompo- 
sition provides a powerful conceptual tool for reasoning about Markov chains 
and visualizing the evolution of their state. In particular, we see that: 


(a) once the state enters (or starts in) a class of recurrent states, it stays within 
that class; since all states in the class are accessible from each other, all 
states in the class will be visited an infinite number of times; 


(b) if the initial state is transient, then the state trajectory contains an ini- 
tial portion consisting of transient states and a final portion consisting of 
recurrent states from the same class. 


For the purpose of understanding long-term behavior of Markov chains, it is im- 
portant to analyze chains that consist of a single recurrent class. For the purpose 
of understanding short-term behavior, it is also important to analyze the mech- 
anism by which any particular class of recurrent states is entered starting from a 
given transient state. These two issues, long-term and short-term behavior, are 
the focus of Sections 6.3 and 6.4, respectively. 


Periodicity 


One more characterization of a recurrent class is of special interest, and relates 
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Single class of recurrent states 


Single class of recurrent states (1 and 2) 
and one transient state (3) 


Two classes of recurrent states 
(class of state1 and class of states 4 and 5) 
and two transient states (2 and 3) 


Figure 6.7: Examples of Markov chain decompositions into recurrent classes and 
transient states. 


to the presence or absence of a certain periodic pattern in the times that a state 
is visited. In particular, a recurrent class is said to be periodic if its states can 
be grouped in d > 1 disjoint subsets $1,...,5q so that all transitions from one 
subset lead to the next subset; see Fig. 6.8. More precisely, 


if i€ S, and py >0, then e : ree 1, 
A recurrent class that is not periodic, is said to be aperiodic. 

Thus, in a periodic recurrent class, we move through the sequence of subsets 
in order, and after d steps, we end up in the same subset. As an example, the 
recurrent class in the second chain of Fig. 6.7 (states 1 and 2) is periodic, and 
the same is true of the class consisting of states 4 and 5 in the third chain of Fig. 
6.7. All other classes in the chains of this figure are aperiodic. 
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Figure 6.8: Structure of a periodic recurrent class. 


Note that given a periodic recurrent class, a positive time n, and a state 7 in 
the class, there must exist some state 7 such that ri;(n) = 0. The reason is that, 
from the definition of periodicity, the states are grouped in subsets $,..., Sq, 
and the subset to which 7 belongs can be reached at time n from the states in 
only one of the subsets. Thus, a way to verify aperiodicity of a given recurrent 
class R, is to check whether there is a special time 7 > 1 and a special state 
s © R that can be reached at time 7% from all initial states in R, ie., ris(7%) > 0 
for alli € R. As an example, consider the first chain in Fig. 6.7. State s = 2 
can be reached at time 7 = 2 starting from every state, so the unique recurrent 
class of that chain is aperiodic. 

A converse statement, which we do not prove, also turns out to be true: 
if a recurrent class is not periodic, then a time 7 and a special state s with the 
above properties can always be found. 


Periodicity 
Consider a recurrent class R. 


e The class is called periodic if its states can be grouped in d > 1 
disjoint subsets Si,...,Sq, so that all transitions from S$; lead to S41 
(or to S$; ifk =d). 


e The class is aperiodic (not periodic) if and only if there exists a time 
7m and a state s in the class, such that pis(7™) > 0 for all i € R. 


6.3 
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STEADY-STATE BEHAVIOR 


In Markov chain models, we are often interested in long-term state occupancy 
behavior, that is, in the n-step transition probabilities ri;(n) when n is very 
large. We have seen in the example of Fig. 6.4 that the ri;(n) may converge to 
steady-state values that are independent of the initial state, so to what extent 
is this behavior typical? 

If there are two or more classes of recurrent states, it is clear that the 
limiting values of the ri;(n) must depend on the initial state (visiting 7 far into 
the future will depend on whether j is in the same class as the initial state 2). 
We will, therefore, restrict attention to chains involving a single recurrent class, 
plus possibly some transient states. This is not as restrictive as it may seem, 
since we know that once the state enters a particular recurrent class, it will stay 
within that class. Thus, asymptotically, the presence of all classes except for one 
is immaterial. 

Even for chains with a single recurrent class, the r;;(n) may fail to converge. 
To see this, consider a recurrent class with two states, 1 and 2, such that from 
state 1 we can only go to 2, and from 2 we can only go to 1 (pig = pai = 1). 
Then, starting at some state, we will be in that same state after any even number 
of transitions, and in the other state after any odd number of transitions. What 
is happening here is that the recurrent class is periodic, and for such a class, it 
can be seen that the rij(n) generically oscillate. 

We now assert that for every state 7, the n-step transition probabilities 
rij(n) approach a limiting value that is independent of i, provided we exclude 
the two situations discussed above (multiple recurrent classes and/or a periodic 
class). This limiting value, denoted by 7;, has the interpretation 


1; &% P(Xn = 9), when n is large, 


and is called the steady-state probability of 7. The following is an important 
theorem. Its proof is quite complicated and is outlined together with several 
other proofs in the theoretical problems section. 


Steady-State Convergence Theorem 


Consider a Markov chain with a single recurrent class, which is aperiodic. 
Then, the states 7 are associated with steady-state probabilities 7; that 
have the following properties. 

(a) lim rij(n) = 7, for all i, 7. 


n—oCo 
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(b) The 7; are the unique solution of the system of equations below: 
m 
Tj => TEP Ry, j=l,...,m, 
k=1 
m 
1 = tis 
k=1 


(c) We have 
nj =0, for all transient states 7, 


tT; > 0, for all recurrent states j. 


Since the steady-state probabilities 7; sum to 1, they form a probability 
distribution on the state space, called the stationary distribution of the chain. 
The reason for the name is that if the initial state is chosen according to this 
distribution, i.e., if 


P(X0 = 3) =7%, j=l,...,m, 


then, using the total probability theorem, we have 
P(X, = 3) = 5) P(Xo = k) pag = D> TePRG = 75; 
k=1 k=1 


where the last equality follows from part (b) of the steady-state convergence 
theorem. Similarly, we obtain P(Xn = j) = 7;, for all n and j. Thus, if the 
initial state is chosen according to the stationary distribution, all subsequent 
states will have the same distribution. 

The equations 


m 
j= S TkPkj > j=l,...,m, 
k=1 


are called the balance equations. They are a simple consequence of part (a) 
of the theorem and the Chapman-Kolmogorov equation. Indeed, once the con- 
vergence of rjj(n) to some 7; is taken for granted, we can consider the equation, 
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take the limit of both sides as n — oo, and recover the balance equations.1 The 
balance equations are a linear system of equations that, together with )77., ™ = 
1, can be solved to obtain the 7;. The following examples illustrate the solution 
process. 


Example 6.4. Consider a two-state Markov chain with transition probabilities 
Pi = 0.8, Piz = 0.2, 


pa = 0.6, p22 = 0.4. 


[This is the same as the chain of Example 6.1 (cf. Fig. 6.1).] The balance equations 
take the form 


71 = M1pii + T2pai, 72 = T1pi2 + T2p22, 


or 


71 = 0.8-71 4+ 0.6: 72, T2 = 0.2-714+ 0.4: 7. 


Note that the above two equations are dependent, since they are both equivalent 
to 
™1 = 372. 


This is a generic property, and in fact it can be shown that one of the balance equa- 
tions depends on the remaining equations (see the theoretical problems). However, 
we know that the 7; satisfy the normalization equation 


m+72=1, 


which supplements the balance equations and suffices to determine the 7; uniquely. 
Indeed, by substituting the equation 7, = 372 into the equation 7, + 72 = 1, we 
obtain 372 + m2 = 1, or 

m2 = 0.25, 


which using the equation 7; + 72 = 1, yields 
MN = 0.75. 


This is consistent with what we found earlier by iterating the Chapman-Kolmogorov 
equation (cf. Fig. 6.4). 


Example 6.5. An absent-minded professor has two umbrellas that she uses when 
commuting from home to office and back. If it rains and an umbrella is available in 


+ According to a famous and important theorem from linear algebra (called the 
Perron-Frobenius theorem), the balance equations always have a nonnegative solution, 
for any Markov chain. What is special about a chain that has a single recurrent class, 
which is aperiodic, is that the solution is unique and is also equal to the limit of the 
n-step transition probabilities r;;(n). 
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her location, she takes it. If it is not raining, she always forgets to take an umbrella. 

Suppose that it rains with probability p each time she commutes, independently of 

other times. What is the steady-state probability that she gets wet on a given day? 
We model this problem using a Markov chain with the following states: 


State 7: 7 umbrellas are available in her current location, 47=0,1,2. 


The transition probability graph is given in Fig. 6.9, and the transition probability 
matrix is 
0 0 1 
0 1l-p p 
l-p p 0 


The chain has a single recurrent class that is aperiodic (assuming 0 < p < 1), so 
the steady-state convergence theorem applies. The balance equations are 


mo = (1—p)m, ™m =(1—p)m +pm2, T2=T7o+pm. 


From the second equation, we obtain 71 = 72, which together with the first equation 
mo = (1— p)2 and the normalization equation 7 + 71 + 72 = 1, yields 


According to the steady-state convergence theorem, the steady-state probability 
that the professor finds herself in a place without an umbrella is 79. The steady- 
state probability that she gets wet is 7 times the probability of rain p. 


1 p { 
-p 
COMES EE Os 
-p p 
No umbrellas Two umbrellas One umbrella 


Figure 6.9: Transition probability graph for Example 6.5. 


Example 6.6. A superstitious professor works in a circular building with m 
doors, where m is odd, and never uses the same door twice in a row. Instead 
he uses with probability p (or probability 1 — p) the door that is adjacent in the 
clockwise direction (or the counterclockwise direction, respectively) to the door he 
used last. What is the probability that a given door will be used on some particular 
day far into the future? 


Sec. 6.3 Steady-State Behavior 


17 


Figure 6.10: Transition probability graph in Example 6.6, for the case of 


m = 5 doors. 


We introduce a Markov chain with the following m states: 


State 7: Last door used is door 1, an ree 


The transition probability graph of the chain is given in Fig. 6.10, for the case 
m = 5. The transition probability matrix is 


0 Pp 
1l-—p 0 

0 1-—p 

p 0 


Assuming that 0 < p < 1, the c 


0 0... 0 1-—p 
p oO... 0 0 
Op... 0 0 
0 0... l-p 0 


hain has a single recurrent class that is aperiodic. 


[To verify aperiodicity, argue by contradiction: if the class were periodic, there 
could be only two subsets of states such that transitions from one subset lead to 
the other, since it is possible to return to the starting state in two transitions. Thus, 


it cannot be possible to reach a 


state 7 from a state j in both an odd and an even 


number of transitions. However, if m is odd, this is true for states 1 and m—a 
contradiction (for example, see the case where m = 5 in Fig. 6.10, doors 1 and 5 can 


be reached from each other in 1 
equations are 


m= (1 — p)t2 q 
™ =pmi1t+(1 


Tm = (1—p)m4 


These equations are easily solv 


transition and also in 4 transitions).] The balance 


r Plm, 
— p)mi41, 4=2,...,m—1, 
r PTm—1- 


ed once we observe that by symmetry, all doors 


should have the same steady-state probability. This suggests the solution 
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Indeed, we see that these 7; satisfy the balance equations as well as the normal- 
ization equation, so they must be the desired steady-state probabilities (by the 
uniquenes part of the steady-state convergence theorem). 

Note that if either p = 0 or p = 1, the chain still has a single recurrent 
class but is periodic. In this case, the n-step transition probabilities r;;(n) do not 
converge to a limit, because the doors are used in a cyclic order. Similarly, if m is 
even, the recurrent class of the chain is periodic, since the states can be grouped 
into two subsets, the even and the odd numbered states, such that from each subset 
one can only go to the other subset. 


Example 6.7. A machine can be either working or broken down on a given day. 
If it is working, it will break down in the next day with probability b, and will 
continue working with probability 1 — b. If it breaks down on a given day, it will 
be repaired and be working in the next day with probability r, and will continue to 
be broken down with probability 1 — r. What is the steady-state probability that 
the machine is working on a given day? 

We introduce a Markov chain with the following two states: 


State 1: Machine is working, State 2: Machine is broken down. 


The transition probability graph of the chain is given in Fig. 6.11. The transition 
probability matrix is 
1-—b b 
r 1l—r|- 


This Markov chain has a single recurrent class that is aperiodic (assuming 0 < b < 1 
and 0 <r < 1), and from the balance equations, we obtain 


m™ =(1—b)m +12, m2 = bm, + (1—1r)7, 


or 
bm, = rte. 


This equation together with the normalization equation 71 + m2 = 1, yields the 
steady-state probabilities 


Working r Broken 


Figure 6.11: Transition probability graph for Example 6.7. 
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The situation considered in the previous example has evidently the Markov 
property, i.e., the state of the machine at the next day depends explicitly only on 
its state at the present day. However, it is possible to use a Markov chain model 
even if there is a dependence on the states at several past days. The general 
idea is to introduce some additional states which encode what has happened in 
preceding periods. Here is an illustration of this technique. 


Example 6.8. Consider a variation of Example 6.7. If the machine remains 
broken for a given number of ¢ days, despite the repair efforts, it is replaced by a 
new working machine. To model this as a Markov chain, we replace the single state 
2, corresponding to a broken down machine, with several states that indicate the 
number of days that the machine is broken. These states are 


State (2,7): The machine has been broken for i days, i = 1,2,...,2. 


The transition probability graph is given in Fig. 6.12 for the case where @ = 4. 
Again this Markov chain has a single recurrent class that is aperiodic. From the 
balance equations, we have 


my = (1—b)m 4+ (maja) +++ + T(2,e-1)) + 72,8); 


™(2,1) = bm1, 
72,4) = (1 —r)mQ@-1), $2 uaeske 
The last two equations can be used to express 7(2,;) in terms of 71, 
T24)=(1—r) "bm, i=1,...,2 


Substituting into the normalization equation 7 + yo (2,1) = 1, we obtain 


1= (1+532a-n*] m= (: een) ™1, 


~ r+0(1-(1-r)) 


Using the equation m2,;) = (1 — rom, we can also obtain explicit formulas for 
the TT (2,4)- 


or 


Ty 


Working pb Broken 


Figure 6.12: Transition probability graph for Example 6.8. A machine that has 
remained broken for = 4 days is replaced by a new, working machine. 
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Long-Term Frequency Interpretations 


Probabilities are often interpreted as relative frequencies in an infinitely long 
string of independent trials. The steady-state probabilities of a Markov chain 
admit a similar interpretation, despite the absence of independence. 

Consider, for example, a Markov chain involving a machine, which at the 
end of any day can be in one of two states, working or broken-down. Each time it 
breaks down, it is immediately repaired at a cost of $1. How are we to model the 
long-term expected cost of repair per day? One possibility is to view it as the 
expected value of the repair cost on a randomly chosen day far into the future; 
this is just the steady-state probability of the broken down state. Alternatively, 
we can calculate the total expected repair cost in n days, where n is very large, 
and divide it by n. Intuition suggests that these two methods of calculation 
should give the same result. Theory supports this intuition, and in general we 
have the following interpretation of steady-state probabilities (a justification is 
given in the theoretical problems section). 


Steady-State Probabilities as Expected State Frequencies 


For a Markov chain with a single class that is aperiodic, the steady-state 
probabilities 7; satisfy 


where v4;(7) is the expected value of the number of visits to state 7 within 
the first n transitions, starting from state 7. 


Based on this interpretation, 7; is the long-term expected fraction of time 
that the state is equal to 7. Each time that state 7 is visited, there is probability 
pjr that the next transition takes us to state k. We conclude that 7jpj, can 
be viewed as the long-term expected fraction of transitions that move the state 
from 7 to kt 


+ In fact, some stronger statements are also true. Namely, whenever we carry 
out the probabilistic experiment and generate a trajectory of the Markov chain over 
an infinite time horizon, the observed long-term frequency with which state 7 is visited 
will be exactly equal to 7;, and the observed long-term frequency of transitions from 
j to k will be exactly equal to 7;p;~x. Even though the trajectory is random, these 
equalities hold with certainty, that is, with probability 1. The exact meaning of this 
statement will become more apparent in the next chapter, when we discuss concepts 
related to the limiting behavior of random processes. 
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Expected Frequency of a Particular Transition 


Consider n transitions of a Markov chain with a single class that is aperiodic, 

starting from a given initial state. Let qj,(m) be the expected number of 

such transitions that take the state from j to k. Then, regardless of the 

initial state, we have 

lim BEY E(n) 
n 


n— co 


= NGPjk- 


The frequency interpretation of 7; and 7j;p;x allows for a simple interpre- 
tation of the balance equations. The state is equal to 7 if and only if there is a 
transition that brings the state to 7. Thus, the expected frequency 7; of visits to 
j is equal to the sum of the expected frequencies 7px; of transitions that lead 
to 7, and 


m 
j= y TkPkj3 
k=1 


see Fig. 6.13. 


Figure 6.13: Interpretation of the balance equations in terms of frequencies. 
In a very large number of transitions, there will be a fraction myp,z; that bring 
the state from k to j. (This also applies to transitions from j to itself, which 
occur with frequency 1jp;;.) The sum of the frequencies of such transitions is the 
frequency 7; of being at state j. 


Birth-Death Processes 


A birth-death process is a Markov chain in which the states are linearly ar- 
ranged and transitions can only occur to a neighboring state, or else leave the 
state unchanged. They arise in many contexts, especially in queueing theory. 
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Figure 6.14 shows the general structure of a birth-death process and also intro- 
duces some generic notation for the transition probabilities. In particular, 


bi = P(Xn41 =i14+1|Xn =), (“birth” probability at state 2), 
di = P(Xn41 =i-1|Xn =i), (“death” probability at state 7). 
1- bo 1-by-dy 


Figure 6.14: Transition probability graph for a birth-death process. 


For a birth-death process, the balance equations can be substantially sim- 
plified. Let us focus on two neighboring states, say, 7 and 7+1. In any trajectory 
of the Markov chain, a transition from 7 to 7+1 has to be followed by a transition 
from i+ 1 to 2, before another transition from 7 to i+ 1 can occur. Therefore, 
the frequency of transitions from 7 to 7+ 1, which is 7;b;, must be equal to the 
frequency of transitions from i+ 1 to i, which is 7;41d;41. This leads to the 
local balance equations? 


Tibi = Migidi+i, 4=0,1,...,m—1. 
Using the local balance equations, we obtain 
bobi +++ bi-1 
didg+++ dj’ 


Together with the normalization equation 5°, 7; = 1, the steady-state probabil- 
ities 7; are easily computed. 


t=1,...,m. 


Example 6.9. (Random Walk with Reflecting Barriers) A person walks 
along a straight line and, at each time period, takes a step to the right with prob- 
ability b, and a step to the left with probability 1 — b. The person starts in one of 


t+ A more formal derivation that does not rely on the frequency interpretation 
proceeds as follows. The balance equation at state 0 is mo(1 — bo) + midi = 70, which 
yields the first local balance equation mobo = 771d1. 

The balance equation at state 1 is mobo + mi(1 — b1 — di) + mad2 = ™. Using 
the local balance equation tobo = midi at the previous state, this is rewritten as 
md, + 7m(1 — b1 — di) + m2d2 = 7, which simplifies to 7b1 = m2d2. We can then 
continue similarly to obtain the local balance states at all other states. 
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the positions 1,2,...,m, but if he reaches position 0 (or position m-+ 1), his step is 
instantly reflected back to position 1 (or position m, respectively). Equivalently, we 
may assume that when the person is in positions 1 or m. he will stay in that position 
with corresponding probability 1 — 6 and b, respectively. We introduce a Markov 
chain model whose states are the positions 1,...,m. The transition probability 
graph of the chain is given in Fig. 6.15. 


b b b b 
1-b 1-b 1-b 1-b 


Figure 6.15: Transition probability graph for the random walk Example 6.9. 


The local balance equations are 
mb = mi41(1 — 5), a=1,...,m—1. 
Thus, 741 = pm, where 


or 


and we can express all the 7; in terms of 7, as 


Tm =p 'm, (a re de 
Using the normalization equation 1 = 71 +---+7am, we obtain 
L=m(1+pt+-:-+p™*) 
which leads to 


i-1 


_ p 
l+pt---+pm-? 


Ti @=1,...,m. 


Note that if p = 1, then 7; = 1/m for all i. 


Example 6.10. (Birth-Death Markov Chains — Queueing) Packets arrive 
at a node of a communication network, where they are stored in a buffer and then 
transmitted. The storage capacity of the buffer is m: if m packets are already 
present, any newly arriving packets are discarded. We discretize time in very small 
periods, and we assume that in each period, at most one event can happen that 
can change the number of packets stored in the node (an arrival of a new packet or 
a completion of the transmission of an existing packet). In particular, we assume 
that at each period, exactly one of the following occurs: 
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(a) one new packet arrives; this happens with a given probability b > 0; 

(b) one existing packet completes transmission; this happens with a given prob- 
ability d > 0 if there is at least one packet in the node, and with probability 
0 otherwise; 

(c) no new packet arrives and no existing packet completes transmission; this 
happens with a probability 1—b—d if there is at least one packet in the node, 
and with probability 1 — b otherwise. 

We introduce a Markov chain with states 0,1,...,m, corresponding to the 
number of packets in the buffer. The transition probability graph is given in 
Fig. 6.16. 

The local balance equations are 

mib = misid, ~=0,1,...,m—-1. 
We define 
a) 
p ae d’ 


and obtain 7:41 = p7i, which leads to 7; = p'To for alli. By using the normalization 
equation 1 = m9 +71 +:::+7m, we obtain 


1=mo(1t+p+---+p”™), 
and 
l—p : 
T- pert ifp#l, 
To = 1 


The steady-state probabilities are then given by 


gi=p) . 
ar «(Ee Fl, 
ga 1 4=0,1,...,m. 


—_ if p=1, 


Figure 6.16: Transition probability graph in Example 6.10. 
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It is interesting to consider what happens when the buffer size m is so large 
that it can be considered as practically infinite. We distinguish two cases. 


(a) Suppose that b < d, or p < 1. In this case, arrivals of new packets are 
less likely than departures of existing packets. This prevents the number 
of packets in the buffer from growing, and the steady-state probabilities 7; 
decrease with i. We observe that as m — oo, we have 1 — p™*" — 1, and 


mm — p'(1—p), for all i. 


We can view these as the steady-state probabilities in a system with an infinite 
buffer. [As a check, note that we have }>*, p'(1— p) = 1] 


(b) Suppose that b > d, or p > 1. In this case, arrivals of new packets are more 
likely than departures of existing packets. The number of packets in the buffer 
tends to increase, and the steady-state probabilities 7; increase with 7. As we 
consider larger and larger buffer sizes m, the steady-state probability of any 
fixed state 7 decreases to zero: 


Ti — 0, for all 7. 


Were we to consider a system with an infinite buffer, we would have a Markov 
chain with a countably infinite number of states. Although we do not have 
the machinery to study such chains, the preceding calculation suggests that 
every state will have zero steady-state probability and will be “transient.” The 
number of packets in queue will generally grow to infinity, and any particular 
state will be visited only a finite number of times. 


6.4 ABSORPTION PROBABILITIES AND EXPECTED TIME 
TO ABSORPTION 


In this section, we study the short-term behavior of Markov chains. We first 
consider the case where the Markov chain starts at a transient state. We are 
interested in the first recurrent state to be entered, as well as in the time until 
this happens. 

When focusing on such questions, the subsequent behavior of the Markov 
chain (after a recurrent state is encountered) is immaterial. We can therefore 
assume, without loss of generality, that every recurrent state k is absorbing, 
Le., 

Prk = 1, Pkj =9 for all j Ak. 


If there is a unique absorbing state k, its steady-state probability is 1 (because 
all other states are transient and have zero steady-state probability), and will be 
reached with probability 1, starting from any initial state. If there are multiple 
absorbing states, the probability that one of them will be eventually reached is 
still 1, but the identity of the absorbing state to be entered is random and the 
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associated probabilities may depend on the starting state. In the sequel, we fix a 
particular absorbing state, denoted by s, and consider the absorption probability 
a; that s is eventually reached, starting from 2: 

a; = P(X» eventually becomes equal to the absorbing state s| Xo = 7). 


Absorption probabilities can be obtained by solving a system of linear equations, 
as indicated below. 


Absorption Probability Equations 


Consider a Markov chain in which each state is either transient or absorbing. 
We fix a particular absorbing state s. Then, the probabilities a; of eventually 
reaching state s, starting from 2, are the unique solution of the equations 


ds = 1, 


aj; = 0, for all absorbing i F s, 


m 
ai = 5 Dig Qj, for all transient 2. 
j=l 


The equations as = 1, and a; = 0, for all absorbing i # s, are evident 
from the definitions. To verify the remaining equations, we argue as follows. Let 
us consider a transient state 7 and let A be the event that state s is eventually 
reached. We have 


a; = P(A| Xo = 2) 


P(A | Xo => 1, X41 = j)P(X => j | Xo => i) (total probability thm.) 


P(A|X1 = J)pij (Markov property) 


Mm: 


QAjPij- 


The uniqueness property of the solution of the absorption probability equations 
requires a separate argument, which is given in the theoretical problems section. 

The next example illustrates how we can use the preceding method to 
calculate the probability of entering a given recurrent class (rather than a given 
absorbing state). 


Example 6.11. Consider the Markov chain shown in Fig. 6.17(a). We would 
like to calculate the probability that the state eventually enters the recurrent class 
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{4,5} starting from one of the transient states. For the purposes of this problem, 
the possible transitions within the recurrent class {4,5} are immaterial. We can 
therefore lump the states in this recurrent class and treat them as a single absorbing 
state (call it state 6); see Fig. 6.17(b). It then suffices to compute the probability 
of eventually entering state 6 in this new chain. 


Figure 6.17: (a) Transition probability graph in Example 6.11. (b) A new 
graph in which states 4 and 5 have been lumped into the absorbing state 
s=6. 


The absorption probabilities a; of eventually reaching state s = 6 starting 
from state i, satisfy the following equations: 


a2 = 0.2a, 0.3a2 + 0.4a3 4 0.1ae, 


az = 0.2a2 os 0.8a6. 
Using the facts a1 = 0 and ag = 1, we obtain 


ag= 0.3a2 + 0.4a3 + 0.1, 
a3 = 0.2a2 + 0.8. 


This is a system of two equations in the two unknowns ag and a3, which can be 
readily solved to yield az = 21/31 and a3 = 29/31. 


Example 6.12. (Gambler’s Ruin) A gambler wins $1 at each round, with 
probability p, and loses $1, with probability 1 — p. Different rounds are assumed 
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independent. The gambler plays continuously until he either accumulates a tar- 
get amount of $m, or loses all his money. What is the probability of eventually 
accumulating the target amount (winning) or of losing his fortune? 

We introduce the Markov chain shown in Fig. 6.18 whose state 7 represents 
the gambler’s wealth at the beginning of a round. The states 1 = 0 and i = m 
correspond to losing and winning, respectively. 

All states are transient, except for the winning and losing states which are 
absorbing. Thus, the problem amounts to finding the probabilities of absorption 
at each one of these two absorbing states. Of course, these absorption probabilities 
depend on the initial state 7. 


p p 
p 
OO 
1-p 1-p Win 


Lose 


Figure 6.18: Transition probability graph for the gambler’s ruin problem 
(Example 6.12). Here m = 4. 


Let us set s = 0 in which case the absorption probability a; is the probability 
of losing, starting from state i. These probabilities satisfy 


ao = 1, 
ai = (1 — p)ai-1 + paisi, i=1,...,.m—-1, 
am = 0. 


These equations can be solved in a variety of ways. It turns out there is an elegant 
method that leads to a nice closed form solution. 
Let us write the equations for the a; as 


(1 — p)(ai-1 — ai) = p(ai — ai41), a=1,...,m—1. 
Then, by denoting 


6; = Qi — Qi41, 4=1,...,m—-1, 


and 


the equations are written as 
6; = poi-1, a=1,...,m—-1, 
from which we obtain 


5: = p'do, i=1,...,m—1. 
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This, together with the equation d9 + 61 +---+6m—1 = Go — Gm = 1, implies that 


(lt p+---+p™ *)do = 1. 


Thus, we have 


ae LPP 
— ifp=1, 
m 
and, more generally, 

“(1 

p'( #) if p41, 

— fp=1 

m mad 


Qi = ao — 64-1 do 
=1 (py +++ + p+ 1)dp 
24 Lag l-p 
l-p 1-p™’ 
1-p' 
= Pie 


en ee ; 4=1,...,m—-1 
1—p™ 
If o = 1, we similarly obtain 
m—t 
ay= 
m 


The probability of winning, starting from a fortune 7, is the complement 1—a,, 
and is equal to 


a 


l-p : 
m ifp#l, 
l1-a=< 1-p 
ae if p=1. 
m 


The solution reveals that if p > 1, which corresponds to p < 1/2 and unfa- 
vorable odds for the gambler, the probability of losing approaches 1 as m — co 
regardless of the size of the initial fortune. This suggests that if you aim for a large 
profit under unfavorable odds, financial ruin is almost certain. 
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Expected Time to Absorption 


We now turn our attention to the expected number of steps until a recurrent 
state is entered (an event that we refer to as “absorption”), starting from a 
particular transient state. For any state 7, we denote 


b= E {number of transitions until absorption, starting from i| 


= E[min{n > 0| Xp is recurrent} | Xo =i). 


If 7 is recurrent, this definition sets p44 to zero. 

We can derive equations for the ju; by using the total expectation theorem. 
We argue that the time to absorption starting from a transient state 7 is equal 
to 1 plus the expected time to absorption starting from the next state, which 
is 7 with probability p;;. This leads to a system of linear equations which is 
stated below. It turns out that these equations have a unique solution, but the 
argument for establishing this fact is beyond our scope. 


Equations for the Expected Time to Absorption 


The expected times pu; to absorption, starting from state 7 are the unique 
solution of the equations 


Li = 0, for all recurrent states 2, 


m 
fyi = 1+ De Daas for all transient states 7. 
j=l 


Example 6.13. (Spiders and Fly) Consider the spiders-and-fly model of Ex- 
ample 6.2. This corresponds to the Markov chain shown in Fig. 6.19. The states 
correspond to possible fly positions, and the absorbing states 1 and m correspond 
to capture by a spider. 
Let us calculate the expected number of steps until the fly is captured. We 
have 
Hi = Um = 0, 


and 
Mi =14+0.3- pi-1 +0.4- ps + 0.3 + pig, for 7 = 2,...,m—1. 


We can solve these equations in a variety of ways, such as for example by 
successive substitution. As an illustration, let m = 4, in which case, the equations 
reduce to 


2 =14+0.4- pe +0.3- ps, 3 =14+0.3- pe + 0.4: ps. 
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The first equation yields 42 = (1/0.6) + (1/2)13, which we can substitute in the 
second equation and solve for w3. We obtain 3 = 10/3 and by substitution again, 
pa = 10/3. 


0.4 0.4 0.4 0.4 
0.3 0.3 0.3 (__) 0.3 & 
0.3 ("-7) 0.3 1 
{ on 8 
ERS 0.3 0.3 0.3 oo 
Figure 6.19: Transition probability graph in Example 6.13. 


Mean First Passage Times 


The same idea used to calculate the expected time to absorption can be used to 
calculate the expected time to reach a particular recurrent state, starting from 
any other state. Throughout this subsection, we consider a Markov chain with 
a single recurrent class. We focus on a special recurrent state s, and we denote 
by t; the mean first passage time from state i to state s, defined by 


t= E [number of transitions to reach s for the first time, starting from i| 
= E[min{n > 0| Xn = s} | Xo =i]. 


The transitions out of state s are irrelevant to the calculation of the mean 
first passage times. We may thus consider a new Markov chain which is identical 
to the original, except that the special state s is converted into an absorbing 
state (by setting pss = 1, and ps; = 0 for all 7 4s). We then compute t; as the 
expected number of steps to absorption starting from 7, using the formulas given 
earlier in this section. We have 


%=1+ Stat, for al iF s, 
j=l 


ts =0. 
This system of linear equations can be solved for the unknowns t;, and is known 
to have a unique solution. 
The above equations give the expected time to reach the special state s 
starting from any other state. We may also want to calculate the mean recur- 
rence time of the special state s, which is defined as 


$ = E{number of transitions up to the first return to s, starting from s] 
= E[min{n > 1| Xn = 5} | Xo =s]. 
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We can obtain t$, once we have the first passage times t;, by using the equation 
m 
#8 =14+ N° psjty. 
j=l 


To justify this equation, we argue that the time to return to s, starting from s, 
is equal to 1 plus the expected time to reach s from the next state, which is 7 
with probability ps; We then apply the total expectation theorem. 


Example 6.14. Consider the “up-to-date”—“behind” model of Example 6.1. 
States 1 and 2 correspond to being up-to-date and being behind, respectively, and 
the transition probabilities are 


Pi = 0.8, Pi2 = 0.2, 
pa = 0.6, p22 = 0.4. 


Let us focus on state s = 1 and calculate the mean first passage time to state 1, 
starting from state 2. We have t; = 0 and 


tg =1+4+ paiti + poate = 14 0.4: to, 


from which 
ee ee: 
sae ee 


The mean recurrence time to state 1 is given by 


* 5 4 
t =14+piuti Tpit OE OR es 


Summary of Facts About Mean First Passage Times 


Consider a Markov chain with a single recurrent class, and let s be a par- 
ticular recurrent state. 


e The mean first passage times ¢; to reach state s starting from 7, are 
the unique solution to the system of equations 


t= 0, t=1+ SS patn for all 7 x Ss. 


e The mean recurrence time ¢§ of state s is given by 


m 
Bad pat: 
j=l 


6.5 
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MORE GENERAL MARKOV CHAINS 


The discrete-time, finite-state Markov chain model that we have considered so 
far is the simplest example of an important Markov process. In this section, 
we briefly discuss some generalizations that involve either a countably infinite 
number of states or a continuous time, or both. A detailed theoretical develop- 
ment for these types of models is beyond our scope, so we just discuss their main 
underlying ideas, relying primarily on examples. 


Chains with Countably Infinite Number of States 


Consider a Markov process {X1, X2,...} whose state can take any positive inte- 
ger value. The transition probabilities 


pig = P(\Xnt1 = 7 | Xn = 1), 1,9 =1,2,... 


are given, and can be used to represent the process with a transition probability 
graph that has an infinite number of nodes, corresponding to the integers 1, 2,... 

It is straightforward to verify, using the total probability theorem in a 
similar way as in Section 6.1, that the n-step transition probabilities 


rij (n) P(Xn j| Xo i), j= i ee 


satisfy the Chapman-Kolmogorov equations 
co 
rig(n +1) = S > rin(n) pry, i,j =1,2,... 


Furthermore, if the rj;(n) converge to steady-state values 7; as n — oo, then by 
taking limit in the preceding equation, we obtain 


oo 
mj = )_ RPK, 1,9 =1,2,... 
k=1 


These are the balance equations for a Markov chain with states 1,2,... 

It is important to have conditions guaranteeing that the rij;(n) indeed con- 
verge to steady-state values 7; as n — oo. As we can expect from the finite-state 
case, such conditions should include some analog of the requirement that there 
is a single recurrent class that is aperiodic. Indeed, we require that: 


(a) each state is accessible from every other state; 


(b) the set of all states is aperiodic in the sense that there is no d > 1 such 
that the states can be grouped in d > 1 disjoint subsets $1,...,5q so that 
all transitions from one subset lead to the next subset. 
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These conditions are sufficient to guarantee the convergence to a steady-state 


lim rij(n) = 75, 1,7 =1,2,... 

n—Cco 
but something peculiar may also happen here, which is not possible if the number 
of states is finite: the limits 7; may not add to 1, so that (71,72,...) may not 
be a probability distribution. In fact, we can prove the following theorem (the 
proof is beyond our scope). 


Steady-State Convergence Theorem 


Under the above accessibility and aperiodicity assumptions (a) and (b), 
there are only two possibilities: 


(1) The ri;(n) converge to a steady state probability distribution (71, 72, ...). 
In this case the 7; uniquely solve the balance equations together with 
the normalization equation 7, + 72 +--:- = 1. Furthermore, the 7; 
have an expected frequency interpretation: 

nj = lim vig (Ms) 
n—-0o n 
where 14;(n) is the expected number of visits to state j within the first 
n transitions, starting from state 7. 


(2) All the ri;(n) converge to 0 as n — oo and the balance equations have 
no solution, other than 7; = 0 for all j. 


For an example of possibility (2) above, consider the packet queueing sys- 
tem of Example 6.10 for the case where the probability b of a packet arrival in 
each period is larger than the probability d of a departure. Then, as we saw 
in that example, as the buffer size m increases, the size of the queue will tend 
to increase without bound, and the steady-state probability of any one state 
will tend to 0 as m — oo. In effect, with infinite buffer space, the system is 
“unstable” when b > d, and all states are “transient.” 

An important consequence of the steady-state convergence theorem is that 
if we can find a probability distribution (71, 72,...) that solves the balance equa- 
tions, then we can be sure that it is the steady-state distribution. This line of 
argument is very useful in queueing systems as illustrated in the following two 
examples. 


Example 6.15. (Queueing with Infinite Buffer Space) Consider, as in Ex- 
ample 6.10, a communication node, where packets arrive and are stored in a buffer 
before getting transmitted. We assume that the node can store an infinite number 
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of packets. We discretize time in very small periods, and we assume that in each 
period, one of the following occurs: 


(a) one new packet arrives; this happens with a given probability b > 0; 


(b) one existing packet completes transmission; this happens with a given prob- 
ability d > 0 if there is at least one packet in the node, and with probability 
0 otherwise; 


(c) no new packet arrives and no existing packet completes transmission; this 
happens with a probability 1—b—d if there is at least one packet in the node, 
and with probability 1 — b otherwise. 


Figure 6.20: Transition probability graph in Example 6.15. 


We introduce a Markov chain with states are 0,1,..., corresponding to the 
number of packets in the buffer. The transition probability graph is given in 
Fig. 6.20. As in the case of a finite number of states, the local balance equations 
are 

mib = mi+1d, 7=0,1,..., 


and we obtain 741 = pt:, where p = b/d. Thus, we have 7; = p'7o for all i. If 
p <1, the normalization equation 1 = ar m™ yields 


in which case 79 = 1 — p, and the steady-state probabilities are 
mi = p'(1—p), i=0,1,... 


If p > 1, which corresponds to the case where the arrival probability b is no less 
than the departure probability d, the normalization equation 1 = mo(1+p+p" +--+) 
implies that 7) = 0, and also 7; = p'mo = 0 for all i. 


Example 6.16. (The M/G/1 Queue) Packets arrive at a node of a communi- 
cation network, where they are stored at an infinite capacity buffer and are then 
transmitted one at a time. The arrival process of the packets is Poissson with rate 
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A, and the transmission time of a packet has a given CDF. Furthermore, the trans- 
mission times of different packets are independent and are also independent from 
all the interarrival times of the arrival process. 

This queueing system is known as the M/G/1 system. With changes in ter- 
minology, it applies to many different practical contexts where “service” is provided 
to “arriving customers,” such as in communication, transportation, and manufac- 
turing, among others. The name M/G/1 is an example of shorthand terminology 
from queueing theory, whereby the first letter (V7 in this case) characterizes the 
customer arrival process (Poisson in this case), the second letter (G in this case) 
characterizes the distribution of the service time of the queue (general in this case), 
and the number (1 in this case) characterizes the number of customers that can be 
simultaneously served. 

To model this system as a discrete-time Markov chain, we focus on the time 
instants when a packet completes transmission and departs from the system. We 
denote by X, the number of packets in the system just after the nth customer’s 
departure. We have 


- me es. if XO; 
BEES gs. if X, =0, 


where S;, is the number of packet arrivals during the (n+1)st packet’s transmission. 
In view of the Poisson assumption, the random variables 51, S2,... are independent 
and their PMF can be calculated using the given CDF of the transmission time, 
and the fact that in an interval of length r, the number of packet arrivals is Poisson- 
distributed with parameter Ar. In particular, let us denote 


apn =P(Sn=k), k=0,1,..., 


and let us assume that if the transmission time R of a packet is a discrete random 


variable taking the values r1,...,7m with probabilities p1,...,pm. Then, we have 
for all k > 0, 
—Ar; k 
e "9 (Ar;) 
“=> 1. 
j=l 


while if R is a continuous random variable with PDF fr(r), we have for all k > 0, 


ana [PS =b)R=nfalrydr= [TOM pala 


=0 r=0 


The probabilities a; define in turn the transition probabilities of the Markov chain 
{Xn}, as follows (see Fig. 6.21): 


Qj if? =O and j > 0, 
Pij = ) Qj-i41 ifi > 0 and j>2i-1, 
0 otherwise. 


Clearly, this Markov chain satisfies the accessibility and aperiodicity condi- 
tions that guarantee steady-state convergence. There are two possibilities: either 
(m0, 71,...) form a probability distribution, or else 7; > 0 for all 7. We will clarify 
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Figure 6.21: Transition probability graph for the number of packets left 
behind by a packet completing transmission in the M/G/1 queue (Example 
6.16). 


the conditions under which each of these cases holds, and we will also calculate the 
transform M(s) (when it exists) of the steady-state distribution (m0, 71,...): 


Indeed, let us multiply the balance equations 
jt 
Tj = Toa; + LS MiQj—i+l1; 


i=1 
with e* and add over all 7. We obtain 


co jt. 


co 
M(s) = ; moaje” + ) ) TiQj—i+l e* 
j=0 j=0 i=1 


= Als) + DS pmieD DT agin? 
i=1 j=i-l 


=A) aoe = mo) 
MiOre (e* — 1)mo A(s) 
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To calculate 70, we take the limit as s — 0 in the above formula, and we use the 
fact M(0) = 1 when {7;} is a probability distribution. We obtain, using the fact 
A(0) = 1 and L’Hospital’s rule, 


= (e® — 1)m0A(s) To _ __ 70 
s-0 es — A(s) 1—(dA(s)/ds) = 1— EN)’ 


where E[N] = Eanes ja; is the expected value of the number N of packet arrivals 
within a packet’s transmission time. Using the iterated expectations formula, we 


have 
E[N] = AE[R], 


where E[R] is the expected value of the transmission time. Thus, 
mT = 1-—AE[RI, 
and the transform of the steady-state distribution {7,;} is 


(e° — 1)(1 — AE[R]) A(s) 
es — A(s) 


M(s) = 


For the above calculation to be correct, we must have E/N] < 1, i.e., packets should 
arrive at a rate that is smaller than the transmission rate of the node. If this is not 
true, the system is not “stable” and there is no steady-state distribution, i.e., the 
only solution of the balance equations is 7; = 0 for all 7. 

Let us finally note that we have introduced the 7; as the steady-state prob- 
ability that 7 packets are left behind in the system by a packet upon completing 
transmission. However, it turns out that 7; is also equal to the steady-state prob- 
ability of 7 packets found in the system by an observer that looks at the system at 
a “typical” time far into the future. This is discussed in the theoretical problems, 
but to get an idea of the underlying reason, note that for each time the number of 
packets in the system increases from n to n + 1 due to an arrival, there will be a 
corresponding future decrease from n+ 1 to n due to a departure. Therefore, in 
the long run, the frequency of transitions from n to n+ 1 is equal to the frequency 
of transitions from n+ 1 to n. Therefore, in steady-state, the system appears 
statistically identical to an arriving and to a departing packet. Now, because the 
packet interarrival times are independent and exponentially distributed, the times 
of packet arrivals are “typical” and do not depend on the number of packets in 
the system. With some care this argument can be made precise, and shows that 
at the times when packets complete their transmissions and depart, the system is 
“typically loaded.” 


Continuous-Time Markov Chains 


We have implicitly assumed so far that the transitions between states take unit 
time. When the time between transitions takes values from a continuous range, 
some new questions arise. For example, what is the proportion of time that the 
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system spends at a particular state (as opposed to the frequency of visits into 
the state)? 

Let the states be denoted by 1,2,..., and let us assume that state tran- 
sitions occur at discrete times, but the time from one transition to the next is 
random. In particular, we assume that: 


(a) If the current state is 7, the next state will be j with a given probability 
Pij- 

(b) The time interval A; between the transition to state i and the transition 
to the next state is exponentially distributed with a given parameter 1: 


P(A; < 6| current state is 1) < 1—e7“9. 


Furthermore, A, is independent of earlier transition times and states. 


The parameter 1; is referred to as the transition rate associated with state 
i. Since the expected transition time is 


E/A\] = | ovje—“i5 dd = s 
0 Yj 


we can interpret 4; as the average number of transitions per unit time. We may 
also view 

Gig = Digi 
as the rate at which the process makes a transition to 7 when at state i. Con- 
sequently, we call qj; the transition rate from i to j. Note that given the 
transition rates qi;, one can obtain the node transition rates using the formula 
Lee ei Gj: 

The state of the chain at time t > 0 is denoted by X(t), and stays constant 
between transitions. Let us recall the memoryless property of the exponential 
distribution, which in our context implies that, for any time t between the kth 
and (& + 1)st transition times t, and t,41, the additional time t,41 — t needed 
to effect the next transition is independent of the time t — t, that the system 
has been in the current state. This implies the Markov character of the process, 
ie., that at any time f, the future of the process, [the random variables X(t) for 
t > t] depend on the past of the process [the values of the random variables X (¢) 
for t < #] only through the present value of X(f). 


Example 6.17. (The M/M/1 Queue) Packets arrive at a node of a communi- 
cation network according to a Poissson process with rate A. The packets are stored 
at an infinite capacity buffer and are then transmitted one at a time. The trans- 
mission time of a packet is exponentially distributed with parameter py, and the 
transmission times of different packets are independent and are also independent 
from all the interarrival times of the arrival process. Thus, this queueing system is 
identical to the special case of the M/G/1 system, where the transmission times are 
exponentially distributed (this is indicated by the second M inthe M/M/1 name). 
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We will model this system using a continuous-time process with state X(t) 
equal to the number of packets in the system at time t [if X(t) > 0, then X(t) —1 
packets are waiting in the queue and one packet is under transmission]. The state 
increases by one when a new packet arrives and decreases by one when an existing 
packet departs. To show that this process is a continuous-time Markov chain, let 
us identify the transition rates 1; and qj; at each state 7. 

Consider first the case where at some time ¢, the system becomes empty, 
i.e., the state becomes equal to 0. Then the next transition will occur at the next 
arrival, which will happen in time that is exponentially distributed with parameter 
A. Thus at state 0, we have the transition rates 


je { A if7j=1, 
= 0 otherwise. 


Consider next the case of a positive state i, and suppose that a transition oc- 
curs at some time ¢ to X(€) =i. If the next transition occurs at time f+ A;, then A; 
is the minimum of two exponentially distributed random variables: the time to the 
next arrival, call it Y, which has parameter \, and the time to the next departure, 
call it Z, which has parameter yz. (We are again using here the memoryless property 
of the exponential distribution.) Thus according to Example 5.15, which deals with 
“competing exponentials,” the time A; is exponentially distributed with parameter 
vy; = A+ yu. Furthermore, the probability that the next transition corresponds to 
an arrival is 


P(Y < Z) =u) re” - pre"* dy dz 
ysz 


mu f et (/ er az) dy 
0 y 
oF —HY 
= mu f e (: ) dy 
0 L 
= afr e ATHY dy 
0 


_ A 
ht 


We thus have for i > 0, giig1 =¥iP(Y < Z) = (A+u)(A/(A+H)) =A. Similarly, 
we obtain that the probability that the next transition corresponds to a departure 
is u/(A+ pw), and we have qiji-1 = %P(Y > Z) =(A+y) (u/(A + 1) =p. Thus 


AX iffg=itl, 
G=\p ifg=i-l, 
0 otherwise. 


The positive transition rates q;; are recorded next to the arcs (i, 7) of the transition 
diagram, as in Fig. 6.22. 


We will be interested in chains for which the discrete-time Markov chain 
corresponding to the transition probabilities p;; satisfies the accessibility and 


Sec. 6.5 More General Markov Chains Al 


A A A A A 
u u u u u 


Figure 6.22: Transition graph for the M/M/1 queue (Example 6.17). 


aperiodicity assumptions of the preceding section. We also require a technical 
condition, namely that the number of transitions in any finite length of time 
is finite with probability one. Almost all models of practical use satisfy this 
condition, although it is possible to construct examples that do not. 

Under the preceding conditions, it can be shown that the limit 


Tj = jim P(X(t) = j|X(0) =i) 


exists and is independent of the initial state 7. We refer to 7; as the steady-state 
probability of state 7. It can be shown that if T;(t) is the expected value of the 
time spent in state 7 up to time t, then, regardless of the initial state, we have 


T;(t 
7; = lim i) 
t—-0co 7 
that is, 7; can be viewed as the long-term proportion of time the process spends 
in state 7. 
The balance equations for a continuous-time Markov chain take the form 


co co 
Pi >, Gi = > DiGiy j =0,1,... 
1=0 1=0 


Similar to discrete-time Markov chains, it can be shown that there are two 
possibilities: 


(1) The steady-state probabilities are all positive and solve uniquely the bal- 
ance equations together with the normalization equation 71 +72+--:=1. 


(2) The steady-state probabilities are all zero. 


To interpret the balance equations, we note that since 7; is the proportion 
of time the process spends in state i, it follows that miqj; can be viewed as 
frequency of transitions from i to 7 (expected number of transitions from i to 
j per unit time). It is seen therefore that the balance equations express the 
intuitive fact that the frequency of transitions out of state 7 (the left side term 
7; >>; Gji) iS equal to the frequency of transitions into state j (the right side 
term >> 729 Mqij)- 

The continuous-time analog of the local balance equations for discrete-time 
chains is 

Ti Qji = TMiQij, ty 9 = 1, Qe vee 
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These equations hold in birth-death systems where qi; = 0 for |i — j| > 1, but 
need not hold in other types of Markov chains. They express the fact that the 
frequencies of transitions from i to 7 and from j to 7 are equal. 

To understand the relationship between the balance equations for continuous- 
time chains and the balance equations for discrete-time chains, consider any 
5 > 0, and the discrete-time Markov chain {Z, |n > 0}, where 


Zn = X (nd), n=0,1,... 


The steady-state distribution of {Z,,} is clearly {7;|7 > O}, the steady-state 
distribution of the continuous chain. The transition probabilities of {Z,, |n > 0} 
can be derived by using the properties of the exponential distribution. We obtain 


Di; = 5qij + 0(8), t# ij, 
pj; =1- 5D qi + 08) 
43 


Using these expressions, the balance equations 
co 
eo S- Ti Dij j 20 
i=0 
for the discrete-time chain {Z,,}, we obtain 


oo oo 
mj = >_ mapas = 5(1— 6 >) as + 0(8)) + > vi(Saiy + 0(8)). 
i=0 i=0 i=0 
1A ij 
Taking the limit as 6 — 0, we obtain the balance equations for the continuous- 
time chain. 


Example 6.18. (The //M/1 Queue — Continued) As in the case of a finite 
number of states, the local balance equations are 


TiA = Titi pf, t=Osl ins, 


and we obtain m41 = pmi, where p = A/p. Thus, we have 7; = p’7o for all i. If 
p <1, the normalization equation 1 = pan m7; yields 


in which case 79 = 1 — p, and the steady-state probabilities are 


ms = p'(1—p), i=0,1,... 
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If p > 1, which corresponds to the case where the arrival probability b is no less 
than the departure probability d, the normalization equation 1 = mo(1t+p+p" +--+) 
implies that 7> = 0, and also 7; = p'mo = 0 for all i. 


Example 6.19. (The M/M/m and M/M/oco Queues) The M/M/m queueing 
system is identical to the M/M/1 system except that m packets can be simul- 
taneously transmitted (i.e., the transmission line of the node has m transmission 
channels). A packet at the head of the queue is routed to any channel that is 
available. The corresponding state transition diagram is shown in Fig. 6.24. 


r A a A A a 
u 2u  (m-1)u mu mu mu 


Figure 6.24: Transition graph for the M/M/m queue (Example 6.19). 


By writing down the local balance equations for the steady-state probabilities 
Tn, we obtain 
Se He ee if n<m, 
mptn ifn>m. 


Po (mp) ,ngm 
n! 
Tn = 
m™ nm 
Po p > n>m 
m! 
where p is given by 
mn 
P= 
mp 


Assuming p < 1, we can calculate mo using the above equations and the condition 


saa Tn = 1. We obtain 


-1 


4 yn lo) ( )” 1 
= mp mp 
no 1+ ni 2s m! mnr-m 
n=1 n=m 
and, finally, 
m—-1 in -_ = 
(rp) (mp) 
n! m\(1— p) 
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In the limiting case where m = oo in the M/M/m system (which is called 
the M/M/co system), the local balance equations become 


ATn-1 = NUTn, an he ee 


m=m (>) ze Nea Ly 2s. 
bb n! 


From the condition }7°°_, mm = 1, we obtain 


sO 


so, finally, 


Therefore, in steady-state, the number in the system is Poisson distributed with 
parameter /,1. 
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A, 
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Consider a sequence X1,X2,... of independent identically distributed random 
variables with mean yp and variance o?. Let 


be the sum of the first n of them. Limit theorems are mostly concerned with the 
properties of S, and related random variables, as n becomes very large. 
Because of independence, we have 


var(S,) = var(X1) + +--+ var(Xn) = no?. 
Thus, the distribution of S;, spreads out as n increases, and does not have a 
meaningful limit. The situation is different if we consider the sample mean 
Xit-::+Xn _ Sn 


M, = ; 
n n 


A quick calculation yields 
g2 
E[M,] = ph, var(Mn) = ae 
In particular, the variance of M;, decreases to zero as n increases, and the bulk of 
its distribution must be very close to the mean yz. This phenomenon is the subject 
of certain laws of large numbers, which generally assert that the sample mean 
M,, (a random variable) converges to the true mean pu (a number), in a precise 
sense. These laws provide a mathematical basis for the loose interpretation of an 
expectation E[X] = yu as the average of a large number of independent samples 
drawn from the distribution of X. 
We will also consider a quantity which is intermediate between S,, and Mn. 
We first subtract nu from S,, to obtain the zero-mean random variable S,, — nw 
and then divide by o,/n, to obtain 


Sn — nyt 
L4ya= 
a/n 
It can be verified (see Section 7.4) that 
E[Z,,] = 0, var(Zn) = 1. 


Since the mean and the variance of Z, remain unchanged as n increases, its 
distribution neither spreads, nor shrinks to a point. The central limit theorem 
is concerned with the asymptotic shape of the distribution of Z,, and asserts that 
it becomes the standard normal distribution. 

Limit theorems are useful for several reasons: 


(a) Conceptually, they provide an interpretation of expectations (as well as 
probabilities) in terms of a long sequence of identical independent experi- 
ments. 


(b) They allow for an approximate analysis of the properties of random vari- 
ables such as S,. This is to be contrasted with an exact analysis which 
would require a formula for the PMF or PDF of S;,, a complicated and 
tedious task when n is large. 


7.1 


Sec. 7.1 Some Useful Inequalities 3 
SOME USEFUL INEQUALITIES 


In this section, we derive some important inequalities. These inequalities use the 
mean, and possibly the variance, of a random variable to draw conclusions on 
the probabilities of certain events. They are primarily useful in situations where 
the mean and variance of a random variable X are easily computable, but the 
distribution of X is either unavailable or hard to calculate. 

We first present the Markov inequality. Loosely speaking it asserts that 
if a nonnegative random variable has a small mean, then the probability that it 
takes a large value must also be small. 


Markov Inequality 


If a random variable X can only take nonnegative values, then 


E|X 
P(X >a) < I for all a > 0. 
a 


To justify the Markov inequality, let us fix a positive number a and consider 
the random variable Y, defined by 


0, if X <a, 
%={ if X >a. 


It is seen that the relation 
always holds and therefore, 


On the other hand, 


from which we obtain 


Example 7.1. Let X be uniformly distributed on the interval [0, 4] and note that 
E[X] = 2. Then, the Markov inequality asserts that 


=0.67, P(X>4)<==05. 


~All dDO 


P(X >2)<2=1, P(X >3)< 


wile 
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By comparing with the exact probabilities 
P(X >2)=05, P(X >3)=0.25,  P(X>4)=0, 
we see that the bounds provided by the Markov inequality can be quite loose. 
We continue with the Chebyshev inequality. Loosely speaking, it asserts 
that if the variance of a random variable is small, then the probability that it 


takes a value far from its mean is also small. Note that the Chebyshev inequality 
does not require the random variable to be nonnegative. 


Chebyshev Inequality 


If X is a random variable with mean ys and variance o?, then 
Pe: 
P(|\X —pl>c)< > for all c > 0. 


Cc’ 


To justify the Chebyshev inequality, we consider the nonnegative random 
variable (X — yu)? and apply the Markov inequality with a = c?. We obtain 


P(x we ee) <I 


The derivation is completed by observing that the event (X —,1)? > c? is identical 
to the event |X — y| > c and 


P(|X —p| >c) = P((X —p)? >?) < s. 
An alternative form of the Chebyshev inequality is obtained by letting 
c= ko, where k is positive, which yields 


o? 1 
P(|X —p| > ko) < jag? = J 


Thus, the probability that a random variable takes a value more than k standard 
deviations away from its mean is at most 1/k?. 

The Chebyshev inequality is generally more powerful than the Markov in- 
equality (the bounds that it provides are more accurate), because it also makes 
use of information on the variance of X. Still, the mean and the variance of 
a random variable are only a rough summary of the properties of its distribu- 
tion, and we cannot expect the bounds to be close approximations of the exact 
probabilities. 


Sec. 7.2 The Weak Law of Large Numbers 5 


Example 7.2. As in Example 7.1, let X be uniformly distributed on [0,4]. Let 
us use the Chebyshev inequality to bound the probability that |X — 2| > 1. We 
have o? = 16/12 = 4/3, and 


P(|X —2/>1)< 


wl 


which is not particularly informative. 
For another example, let X be exponentially distributed with parameter \ = 
1, so that ELX] = var(X) = 1. For c > 1, using Chebyshev’s inequality, we obtain 


1 
(e— 1" 


P(X >c) =P(X -—1>c-1)<P(|X-1]>c-1)< 


This is again conservative compared to the exact answer P(X >c) =e °. 


7.2 THE WEAK LAW OF LARGE NUMBERS 


The weak law of large numbers asserts that the sample mean of a large number 
of independent identically distributed random variables is very close to the true 
mean, with high probability. 

As in the introduction to this chapter, we consider a sequence X1, X2,... of 
independent identically distributed random variables with mean jz and variance 
o*, and define the sample mean by 


We have 


and, using independence, 


var(Xy+-+--+ Xn) — var(X1)+---+var(Xn) no? 0? 
var(M,,) = 72 = “a a Sacra 


We apply Chebyshev’s inequality and obtain 


2 


oO 


for any € > 0. 

We observe that for any fixed € > 0, the right-hand side of this inequality goes to 
zero as n increases. As a consequence, we obtain the weak law of large numbers, 
which is stated below. It turns out that this law remains true even if the X; 


6 Limit Theorems Chap. 7 


have infinite variance, but a much more elaborate argument is needed, which we 
omit. The only assumption needed is that E[X;j] is well-defined and finite. 


The Weak Law of Large Numbers (WLLN) 


Let X1, X2,... be independent identically distributed random variables with 
mean p. For every € > 0, we have 


Xy +--+ Xn 
n 


P([Mn—nl>9 =P (| u) ><) +0 as 1 — OO. 


The WLLN states that for large n, the “bulk” of the distribution of Mn is 
concentrated near p. That is, if we consider a positive length interval [u—e, u+€] 
around py, then there is high probability that M,, will fall in that interval; as 
n — oo, this probability converges to 1. Of course, if € is very small, we may 
have to wait longer (i.e., need a larger value of n) before we can assert that Mn 
is highly likely to fall in that interval. 


Example 7.3. Probabilities and Frequencies. Consider an event A defined 
in the context of some probabilistic experiment. Let p = P(A) be the probability of 
that event. We consider n independent repetitions of the experiment, and let My, 
be the fraction of time that event A occurred; in this context, Mn is often called 
the empirical frequency of A. Note that 


1 ee 
n 


where X; is 1 whenever A occurs, and 0 otherwise; in particular, E[X;] = p. The 
weak law applies and shows that when n is large, the empirical frequency is most 
likely to be within € of p. Loosely speaking, this allows us to say that empirical fre- 
quencies are faithful estimates of p. Alternatively, this is a step towards interpreting 
the probability p as the frequency of occurrence of A. 


Example 7.4. Polling. Let p be the fraction of voters who support a particular 
candidate for office. We interview n “randomly selected” voters and record the 
fraction M,, of them that support the candidate. We view M,, as our estimate of p 
and would like to investigate its properties. 

We interpret “randomly selected” to mean that the n voters are chosen in- 
dependently and uniformly from the given population. Thus, the reply of each 
person interviewed can be viewed as an independent Bernoulli trial X; with success 
probability p and variance o? = p(1 — p). The Chebyshev inequality yields 


P(|Mn —p| >) < Boe 2). 


ne2 
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The true value of the parameter p is assumed to be unknown. On the other hand, 
it is easily verified that p(1 — p) < 1/4, which yields 


1 
P(|Mn —pl 2 €) S$ Ga. 


For example, if « = 0.1 and n = 100, we obtain 


P(|Mioo — p| > 0.1) < : 0.25. 


— 4-100- (0.1)? ~ 
In words, with a sample size of n = 100, the probability that our estimate is wrong 
by more than 0.1 is no larger than 0.25. 

Suppose now that we impose some tight specifications on our poll. We would 
like to have high confidence (probability at least 95%) that our estimate will be 
very accurate (within .01 of p). How many voters should be sampled? 

The only guarantee that we have at this point is the inequality 


1 
P(|M,, — p| > 0.01) < ——__.. 
( pe oul) 4n(0.01)2 
We will be sure to satisfy the above specifications if we choose n large enough so 
that i 
<1 .95 = 0.05 
In(0.01)2 = 0.95 = 0.05, 


which yields n > 50,000. This choice of n has the specified properties but is actually 
fairly conservative, because it is based on the rather loose Chebyshev inequality. A 
refinement will be considered in Section 7.4. 


7.3 CONVERGENCE IN PROBABILITY 


We can interpret the WLLN as stating that “M, converges to yu.” However, 
since M1, Mo2,... is a sequence of random variables, not a sequence of numbers, 
the meaning of convergence has to be made precise. A particular definition 
is provided below. To facilitate the comparison with the ordinary notion of 
convergence, we also include the definition of the latter. 


Convergence of a Deterministic Sequence 


Let ai,a2,... be a sequence of real numbers, and let a be another real 
number. We say that the sequence ay, converges to a, or limn—oo Gn = G, if 
for every € > 0 there exists some no such that 


lan — al <e, for all n > no. 
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Intuitively, for any given accuracy level €, an must be within e€ of a, when 
n is large enough. 


Convergence in Probability 


Let Y1, Y2,... be a sequence of random variables (not necessarily indepen- 
dent), and let a be a real number. We say that the sequence Y,, converges 
to a in probability, if for every « > 0, we have 


lim P(|Yn — a| > €) =0. 


Given this definition, the WLLN simply says that the sample mean con- 
verges in probability to the true mean pL. 

If the random variables Yi, Y2,... have a PMF or a PDF and converge in 
probability to a, then according to the above definition, “almost all” of the PMF 
or PDF of Y;, is concentrated to within a an e-interval around a for large values 
of n. It is also instructive to rephrase the above definition as follows: for every 
€ > 0, and for every 6 > 0, there exists some no such that 


P(|Yn —a| >) <6, for all n > no. 


If we refer to € as the accuracy level, and 6 as the confidence level, the definition 
takes the following intuitive form: for any given level of accuracy and confidence, 
Yn will be equal to a, within these levels of accuracy and confidence, provided 
that n is large enough. 


Example 7.5. Consider a sequence of independent random variables X, that are 
uniformly distributed over the interval [0,1], and let 


Yn = min{X1,..., Xn}. 


The sequence of values of Y,, cannot increase as n increases, and it will occasionally 
decrease (when a value of X, that is smaller than the preceding values is obtained). 
Thus, we intuitively expect that Y, converges to zero. Indeed, for « > 0, we have 
using the independence of the X,, 

P(|Y¥n —0| > «) =P(Xi >e,...,Xn De) 


= P(X, > 2) P(X, > €) 
=(l-e)”. 


Since this is true for every € > 0, we conclude that Y, converges to zero, in proba- 
bility. 
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Example 7.6. Let Y be an exponentially distributed random variable with 
parameter A = 1. For any positive integer n, let Y, = Y/n. (Note that these 
random variables are dependent.) We wish to investigate whether the sequence Y;, 
converges to zero. 

For € > 0, we have 


P(|Yn —0| >€) =P(Y¥n > 6) =P(Y > ne) =e ™. 


In particular, 
lim P(|¥, —0| >) = lim e""* =0. 


Since this is the case for every € > 0, Y, converges to zero, in probability. 


One might be tempted to believe that if a sequence Y, converges to a 
number a, then E[Y,] must also converge to a. The following example shows 
that this need not be the case. 


Example 7.7. Consider a sequence of discrete random variables Y, with the 
following distribution: 


1 
1-——, fe = 
n’ or 7] 0, 
PY=y=4 1 for y =n? 
n 
0, elsewhere. 


For every € > 0, we have 


: Maa al 
eel et 


and Y, converges to zero in probability. On the other hand, E[Y,] = n?/n = n, 
which goes to infinity as n increases. 


7.4 THE CENTRAL LIMIT THEOREM 


According to the weak law of large numbers, the distribution of the sample 
mean M,, is increasingly concentrated in the near vicinity of the true mean wp. 
In particular, its variance tends to zero. On the other hand, the variance of the 
sum Sy, = X1+--:+ Xn = nM, increases to infinity, and the distribution of 
Sn cannot be said to converge to anything meaningful. An intermediate view 
is obtained by considering the deviation S;, — ny of S, from its mean ny, and 
scaling it by a factor proportional to 1/./n. What is special about this particular 
scaling is that it keeps the variance at a constant level. The central limit theorem 


10 Limit Theorems Chap. 7 


asserts that the distribution of this scaled random variable approaches a normal 
distribution. 

More specifically, let X1, X2,... be a sequence of independent identically 
distributed random variables with mean py and variance a2. We define 


— Sn- np Xit+::+Xn— np 


a/n o/n 


Zn 


An easy calculation yields 


E|Z,| = 0, 
Ar oV/n 
and 
var(X1 +--+ Xn) var(X1)+---+var(Xn) no? 
NEG) Fs o2n ~ on ~ no ‘: 


The Central Limit Theorem 


Let X1, X2,... be a sequence of independent identically distributed random 
variables with common mean js and variance o?, and define 


— Artest Xn — ny 
a a/n , 


Then, the CDF of Z, converges to the standard normal CDF 


1 z 
O(z) = aa e-*"/2 da, 


Zn 


in the sense that 


lim P(Z, < z) = ®(z), for every z. 


n—oo 


The central limit theorem is surprisingly general. Besides independence, 
and the implicit assumption that the mean and variance are well-defined and 
finite, it places no other requirement on the distribution of the X;, which could be 
discrete, continuous, or mixed random variables. It is of tremendous importance 
for several reasons, both conceptual, as well as practical. On the conceptual side, 
it indicates that the sum of a large number of independent random variables 
is approximately normal. As such, it applies to many situations in which a 
random effect is the sum of a large number of small but independent random 
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factors. Noise in many natural or engineered systems has this property. In a wide 
array of contexts, it has been found empirically that the statistics of noise are 
well-described by normal distributions, and the central limit theorem provides a 
convincing explanation for this phenomenon. 

On the practical side, the central limit theorem eliminates the need for 
detailed probabilistic models and for tedious manipulations of PMFs and PDFs. 
Rather, it allows the calculation of certain probabilities by simply referring to the 
normal CDF table. Furthermore, these calculations only require the knowledge 
of means and variances. 


Approximations Based on the Central Limit Theorem 


The central limit theorem allows us to calculate probabilities related to Zp, as 
if Z, were normal. Since normality is preserved under linear transformations, 
this is equivalent to treating S;, as a normal random variable with mean ny and 
variance no?. 


Normal Approximation Based on the Central Limit Theorem 


Let S, = Xi +---+ Xn, where the X; are independent identically dis- 
tributed random variables with mean yw and variance o?. If n is large, the 
probability P(S, < c) can be approximated by treating S, as if it were 
normal, according to the following procedure. 


1. Calculate the mean nj and the variance no? of Sp. 
2. Calculate the normalized value z = (c — np) /o/n. 


3. Use the approximation 
P(S, <c) & ®(z), 


where ®(z) is available from standard normal CDF tables. 


Example 7.8. We load on a plane 100 packages whose weights are independent 
random variables that are uniformly distributed between 5 and 50 pounds. What is 
the probability that the total weight will exceed 3000 pounds? It is not easy to cal- 
culate the CDF of the total weight and the desired probability, but an approximate 
answer can be quickly obtained using the central limit theorem. 

We want to calculate P(Sio0 > 3000), where Sioo0 is the sum of the 100 
packages. The mean and the variance of the weight of a single package are 


5 +50 2 (50-5)? 
ee = i = 168. 
m 5 7.5, o 3 68.75, 


12 
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based on the formulas for the mean and variance of the uniform PDF. We thus 


calculate the normalized value 


_ 3000 —100-27.5 250 
Vv 168.75 - 100 129.9 


and use the standard normal tables to obtain the approximation 


= 1.92, 


P(Sio0 < 3000) + (1.92) = 0.9726. 
Thus the desired probability is 


P(Si00 > 3000) = 1 — P(Si00 < 3000) 1 — 0.9726 = 0.0274. 


Example 7.9. A machine processes parts, one at a time. The processing times 
of different parts are independent random variables, uniformly distributed on [1, 5]. 
We wish to approximate the probability that the number of parts processed within 
320 time units is at least 100. 

Let us call N320 this number. We want to calculate P(N320 > 100). There is 
no obvious way of expressing the random variable N320 as the sum of independent 
random variables, but we can proceed differently. Let X; be the processing time 
of the ith part, and let Sio9 = X1 +---+ Xioo be the total processing time of the 
first 100 parts. The event {N320 > 100} is the same as the event {Sio0 < 320}, 
and we can now use a normal approximation to the distribution of Sio9. Note that 
pe = E[X;] = 3 and o? = var(X;) = 16/12 = 4/3. We calculate the normalized 
value 


, — 320—mp _ 320-300 _ 
avn ,/ 100 - 4/3 


and use the approximation 


P(Si00 < 320) © ®(1.73) = 0.9582. 


If the variance of the X; is unknown, but an upper bound is available, 


the normal approximation can be used to obtain bounds on the probabilities of 
interest. 


Example 7.10. Let us revisit the polling problem in Example 7.4. We poll n 
voters and record the fraction M, of those polled who are in favor of a particular 
candidate. If p is the fraction of the entire voter population that supports this 


candidate, then 
XG XS, 
fig Eee, 
n 
where the X; are independent Bernoulli random variables with parameter p. In 


particular, M,, has mean p and variance p(1—p)/n. By the normal approximation, 
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X,+-:-+ Xn is approximately normal, and therefore M,, is also approximately 
normal. 

We are interested in the probability P (\Mn —p|> ) that the polling error is 
larger than some desired accuracy ¢. Because of the symmetry of the normal PDF 
around the mean, we have 


P(|M, — p| > €) ¥ 2P(Mn — p>). 


The variance p(1—p)/n of Mn —p depends on p and is therefore unknown. We note 
that the probability of a large deviation from the mean increases with the variance. 
Thus, we can obtain an upper bound on P(M,, —p> ) by assuming that M, — p 
has the largest possible variance, namely, 1/4n. To calculate this upper bound, we 
evaluate the standardized value 


€ 


Z= —— 


1/(Vn)' 
and use the normal approximation 
P(M, — p>) <1-—(z) =1—®(2eV/n). 


For instance, consider the case where n = 100 and « = 0.1. Assuming the 
worst-case variance, we obtain 


P(|Mioo — p| > 0.1) © 2P(M,, — p > 0.1) 
< 2-26(2-0.1- V100) = 2— 26(2) = 2—2-0.977 = 0.046. 


This is much smaller (more accurate) than the estimate that was obtained in Ex- 
ample 7.4 using the Chebyshev inequality. 

We now consider a reverse problem. How large a sample size n is needed 
if we wish our estimate M,, to be within 0.01 of p with probability at least 0.95? 
Assuming again the worst possible variance, we are led to the condition 


2 — 26(2-0.01- Vn) < 0.05, 
or 
(2-0.01- Yn) > 0.975. 
From the normal tables, we see that ®(1.96) = 0.975, which leads to 


2-0.01- Vn > 1.96, 


or ( y 
1.96 
> ———. = 9604. 
"= 7. (0.01)? 
This is significantly better than the sample size of 50,000 that we found using 
Chebyshev’s inequality. 


The normal approximation is increasingly accurate as n tends to infinity, 
but in practice we are generally faced with specific and finite values of n. It 
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would be useful to know how large an n is needed before the approximation 
can be trusted, but there are no simple and general guidelines. Much depends 
on whether the distribution of the X; is close to normal to start with and, in 
particular, whether it is symmetric. For example, if the X; are uniform, then Sg 
is already very close to normal. But if the X; are, say, exponential, a significantly 
larger n will be needed before the distribution of S;, is close to a normal one. 
Furthermore, the normal approximation to P(S;, < c) is generally more faithful 
when c is in the vicinity of the mean of Sy. 


The De Moivre — Laplace Approximation to the Binomial 


A binomial random variable S,, with parameters n and p can be viewed as the 
sum of n independent Bernoulli random variables X1,...,Xn, with common 
parameter p: 


Recall that 


w= E[Xi] =p, o = \/var(Xi) = V/p(1—p), 


We will now use the approximation suggested by the central limit theorem to 
provide an approximation for the probability of the event {k < S, < ¢}, where 
k and @ are given integers. We express the event of interest in terms of a stan- 
dardized random variable, using the equivalence 


k—np Z Sy — np ¥ £—np 


Vnp—p)~ V/np(l—p) ~ /np(l—p) 


k< Sy, <2 => 


By the central limit theorem, (S, — np)/./np(1 — p) has approximately a stan- 
dard normal distribution, and we obtain 


k—np Sn — np é— np 
P(k < Sy <¢) =P 
(k< Sn < &) ( -Agtes s /np(1 — p) = Inn) 


“ l—np 5 k—np 
np(1 — p) np(1 — p) 
An approximation of this form is equivalent to treating S, as a normal 
random variable with mean np and variance np(1 — p). Figure 7.1 provides an 
illustration and indicates that a more accurate approximation may be possible if 


we replace k and @ by k — $ and (+ s, respectively. The corresponding formula 
is given below. 
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(a) (b) 


Figure 7.1: The central limit approximation treats a binomial random variable 
Sn as if it were normal with mean np and variance np(1— p). This figure shows a 
binomial PMF together with the approximating normal PDF. (a) A first approx- 
imation of a binomial probability P(k < Sn < £) is obtained by integrating the 
area under the normal PDF from & to @, which is the shaded area in the figure. 
(b) With the approach in (a), if we have k = £, the probability P(S, = k) would 
be approximated by zero. A potential remedy would be to use the normal prob- 
ability between k — 4 and k + s to approximate P(S, =k). By extending this 
idea, P(k < Sn < £) can be approximated by using the area under the normal 
PDF from k — 4 to + $, which corresponds to the shaded area. 


De Moivre — Laplace Approximation to the Binomial 


If S,, is a binomial random variable with parameters n and p, n is large, and 
k, € are nonnegative integers, then 


gsi_T k—i_ 
P(k < Sn <0) xO +5 — np 5 — np 
np(1 — p) np(1 — p) 


Example 7.11. Let S, be a binomial random variable with parameters n = 36 
and p= 0.5. An exact calculation yields 


P(S, <21)= >> @ (0.5)°° = 0.8785. 


k=0 


The central limit approximation, without the above discussed refinement, yields 


P(S, < 21) x ® 


21 — np c= 


=o . ) = 0(1) = 0.8413. 


Using the proposed refinement, we have 


rae ae Gan =) = (1.17) =0.879, 


np(1— p 3 
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which is much closer to the exact value. 
The de Moivre — Laplace formula also allows us to approximate the probability 
of a single value. For example, 


19.5 — 18 
3 


P(S, = 19) = © ( ) ® (“5 =) 0.6915 — 05675 = 0.124. 


This is very close to the exact value which is 


(33) (0:5)?°= 01951. 


7.5 THE STRONG LAW OF LARGE NUMBERS 


The strong law of large numbers is similar to the weak law in that it also deals 
with the convergence of the sample mean to the true mean. It is different, 
however, because it refers to another type of convergence. 


The Strong Law of Large Numbers (SLLN) 


Let X1, X2,... be a sequence of independent identically distributed random 
variables with mean yu. Then, the sequence of sample means M;, = (X1 + 
---+Xp)/n converges to 14, with probability 1, in the sense that 


Dee re a en 
P (tim bets Ay) a1 


n—0o n 


In order to interpret the SSLN, we need to go back to our original de- 
scription of probabilistic models in terms of sample spaces. The contemplated 
experiment is infinitely long and generates experimental values for each one of 
the random variables in the sequence X1, X2,.... Thus, it is best to think of the 
sample space 2 as a set of infinite sequences w = (21, 22,...) of real numbers: 
any such sequence is a possible outcome of the experiment. Let us now define the 
subset A of 2 consisting of those sequences (21, x2,...) whose long-term average 
is LU, 1.€., 

(a1,%2,...)EA => lim Pee, 
n—-0o n 
The SLLN states that all of the probability is concentrated on this particular 
subset of Q. Equivalently, the collection of outcomes that do not belong to A 
(infinite sequences whose long-term average is not j:) has probability zero. 
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The difference between the weak and the strong law is subtle and deserves 
close scrutiny. The weak law states that the probability P(|Mn — | >) of a 
significant deviation of M, from p goes to zero as n — oo. Still, for any finite 
n, this probability can be positive and it is conceivable that once in a while, 
even if infrequently, M, deviates significantly from ps. The weak law provides 
no conclusive information on the number of such deviations, but the strong law 
does. According to the strong law, and with probability 1, M, converges to p. 
This implies that for any given € > 0, the difference |M,, — | will exceed € only 
a finite number of times. 


Example 7.12. Probabilities and Frequencies. As in Example 7.3, con- 
sider an event A defined in terms of some probabilistic experiment. We consider 
a sequence of independent repetitions of the same experiment, and let M, be the 
fraction of the first n trials in which A occurs. The strong law of large numbers 
asserts that M, converges to P(A), with probability 1. 

We have often talked intuitively about the probability of an event A as the 
frequency with which it occurs in an infinitely long sequence of independent trials. 
The strong law backs this intuition and establishes that the long-term frequency 
of occurrence of A is indeed equal to P(A), with certainty (the probability of this 
happening is 1). 


Convergence with Probability 1 


The convergence concept behind the strong law is different than the notion em- 
ployed in the weak law. We provide here a definition and some discussion of this 
new convergence concept. 


Convergence with Probability 1 


Let Y1, Y2,... be a sequence of random variables (not necessarily indepen- 
dent) associated with the same probability model. Let c be a real number. 
We say that Y, converges to c with probability 1 (or almost surely) if 


P ( lim YeSe) Sd. 


n—0oo 


Similar to our earlier discussion, the right way of interpreting this type of 
convergence is in terms of a sample space consisting of infinite sequences: all 
of the probability is concentrated on those sequences that converge to c. This 
does not mean that other sequences are impossible, only that they are extremely 
unlikely, in the sense that their total probability is zero. 
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The example below illustrates the difference between convergence in prob- 
ability and convergence with probability 1. 


Example 7.13. Consider a discrete-time arrival process. The set of times is 
partitioned into consecutive intervals of the form J, = {2¥, Oe Toe OPE T 1}. 
Note that the length of J, is 2*, which increases with k. During each interval J;,, 
there is exactly one arrival, and all times within an interval are equally likely. The 
arrival times within different intervals are assumed to be independent. Let us define 
Y, = 1 if there is an arrival at time n, and Y,, = 0 if there is no arrival. 

We have P(Y, 4 0) = 1/2", ifn € I. Note that as n increases, it belongs to 
intervals J, with increasingly large indices k. Consequently, 


: ; 1 
ee aa a 
and we conclude that Y, converges to 0 in probability. However, when we carry out 
the experiment, the total number of arrivals is infinite (one arrival during each 
interval I,). Therefore, Y, is unity for infinitely many values of n, the event 
{limn—oo Yn = 0} has zero probability, and we do not have convergence with prob- 
ability 1. 

Intuitively, the following is happening. At any given time, there is a small 
(and diminishing with n) probability of a substantial deviation from 0 (convergence 
in probability). On the other hand, given enough time, a substantial deviation 
from 0 is certain to occur, and for this reason, we do not have convergence with 
probability 1. 


Example 7.14. Let X 1, X2,... be a sequence of independent random variables 
that are uniformly distributed on [0,1], and let Y, = min{X1,...,Xn}. We wish 
to show that Y, converges to 0, with probability 1. 

In any execution of the experiment, the sequence Y,, is nonincreasing, i.e., 
Yn+1 < Yn for all n. Since this sequence is bounded below by zero, it must have a 
limit, which we denote by Y. Let us fix some e > 0. If Y > €, then X; > « for all 7, 
which implies that 


P(Y >6) < P(X > e,...,Xn > 6) =(1—-€)”. 


Since this is true for all n, we must have 


P(Y >c) < lim (1—-«)” =0. 
This shows that P(Y > «) = 0, for any positive «. We conclude that P(Y > 0) = 0, 


which implies that P(Y = 0) = 1. Since Y is the limit of Y,, we see that Y, 
converges to zero with probability 1. 


